from:"JoonSoo Kim"

Re: [PATCH v2 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of first flushed entry

2012-11-27 Thread JoonSoo Kim

Hello, Andrew.

2012/11/20 Minchan Kim :
> Hi Joonsoo,
> Sorry for the delay.
>
> On Thu, Nov 15, 2012 at 02:09:04AM +0900, JoonSoo Kim wrote:
>> Hi, Minchan.
>>
>> 2012/11/14 Minchan Kim :
>> > On Tue, Nov 13, 2012 at 11:12:28PM +0900, JoonSoo Kim wrote:
>> >> 2012/11/13 Minchan Kim :
>> >> > On Tue, Nov 13, 2012 at 09:30:57AM +0900, JoonSoo Kim wrote:
>> >> >> 2012/11/3 Minchan Kim :
>> >> >> > Hi Joonsoo,
>> >> >> >
>> >> >> > On Sat, Nov 03, 2012 at 04:07:25AM +0900, JoonSoo Kim wrote:
>> >> >> >> Hello, Minchan.
>> >> >> >>
>> >> >> >> 2012/11/1 Minchan Kim :
>> >> >> >> > On Thu, Nov 01, 2012 at 01:56:36AM +0900, Joonsoo Kim wrote:
>> >> >> >> >> In current code, after flush_all_zero_pkmaps() is invoked,
>> >> >> >> >> then re-iterate all pkmaps. It can be optimized if 
>> >> >> >> >> flush_all_zero_pkmaps()
>> >> >> >> >> return index of first flushed entry. With this index,
>> >> >> >> >> we can immediately map highmem page to virtual address 
>> >> >> >> >> represented by index.
>> >> >> >> >> So change return type of flush_all_zero_pkmaps()
>> >> >> >> >> and return index of first flushed entry.
>> >> >> >> >>
>> >> >> >> >> Additionally, update last_pkmap_nr to this index.
>> >> >> >> >> It is certain that entry which is below this index is occupied 
>> >> >> >> >> by other mapping,
>> >> >> >> >> therefore updating last_pkmap_nr to this index is reasonable 
>> >> >> >> >> optimization.
>> >> >> >> >>
>> >> >> >> >> Cc: Mel Gorman 
>> >> >> >> >> Cc: Peter Zijlstra 
>> >> >> >> >> Cc: Minchan Kim 
>> >> >> >> >> Signed-off-by: Joonsoo Kim 
>> >> >> >> >>
>> >> >> >> >> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
>> >> >> >> >> index ef788b5..97ad208 100644
>> >> >> >> >> --- a/include/linux/highmem.h
>> >> >> >> >> +++ b/include/linux/highmem.h
>> >> >> >> >> @@ -32,6 +32,7 @@ static inline void 
>> >> >> >> >> invalidate_kernel_vmap_range(void *vaddr, int size)
>> >> >> >> >>
>> >> >> >> >>  #ifdef CONFIG_HIGHMEM
>> >> >> >> >>  #include 
>> >> >> >> >> +#define PKMAP_INVALID_INDEX (LAST_PKMAP)
>> >> >> >> >>
>> >> >> >> >>  /* declarations for linux/mm/highmem.c */
>> >> >> >> >>  unsigned int nr_free_highpages(void);
>> >> >> >> >> diff --git a/mm/highmem.c b/mm/highmem.c
>> >> >> >> >> index d98b0a9..b365f7b 100644
>> >> >> >> >> --- a/mm/highmem.c
>> >> >> >> >> +++ b/mm/highmem.c
>> >> >> >> >> @@ -106,10 +106,10 @@ struct page *kmap_to_page(void *vaddr)
>> >> >> >> >>   return virt_to_page(addr);
>> >> >> >> >>  }
>> >> >> >> >>
>> >> >> >> >> -static void flush_all_zero_pkmaps(void)
>> >> >> >> >> +static unsigned int flush_all_zero_pkmaps(void)
>> >> >> >> >>  {
>> >> >> >> >>   int i;
>> >> >> >> >> - int need_flush = 0;
>> >> >> >> >> + unsigned int index = PKMAP_INVALID_INDEX;
>> >> >> >> >>
>> >> >> >> >>   flush_cache_kmaps();
>> >> >> >> >>
>> >> >> >> >> @@ -141,10 +141,13 @@ static void flush_all_zero_pkmaps(void)
>> >> >> >> >> &pkmap_page_table[i]);
>> >> >> >> >>
>> >> >> >> >>   set_page_address(page, NULL);
>> >> >> >> >>

[PATCH v2 0/3] introduce static_vm for ARM-specific static mapped area

2012-11-27 Thread Joonsoo Kim

In current implementation, we used ARM-specific flag, that is,
VM_ARM_STATIC_MAPPING, for distinguishing ARM specific static mapped area.
The purpose of static mapped area is to re-use static mapped area when
entire physical address range of the ioremap request can be covered
by this area.

This implementation causes needless overhead for some cases.
For example, assume that there is only one static mapped area and
vmlist has 300 areas. Every time we call ioremap, we check 300 areas for
deciding whether it is matched or not. Moreover, even if there is
no static mapped area and vmlist has 300 areas, every time we call
ioremap, we check 300 areas in now.

If we construct a extra list for static mapped area, we can eliminate
above mentioned overhead.
With a extra list, if there is one static mapped area,
we just check only one area and proceed next operation quickly.

In fact, it is not a critical problem, because ioremap is not frequently
used. But reducing overhead is better idea.

Another reason for doing this work is for removing architecture dependency
on vmalloc layer. I think that vmlist and vmlist_lock is internal data
structure for vmalloc layer. Some codes for debugging and stat inevitably
use vmlist and vmlist_lock. But it is preferable that they are used
as least as possible in outside of vmalloc.c

Changelog
v1->v2: 
  [2/3]: patch description is improved.
  Rebased on v3.7-rc7

Joonsoo Kim (3):
  ARM: vmregion: remove vmregion code entirely
  ARM: static_vm: introduce an infrastructure for static mapped area
  ARM: mm: use static_vm for managing static mapped areas

 arch/arm/include/asm/mach/static_vm.h |   51 
 arch/arm/mm/Makefile  |2 +-
 arch/arm/mm/ioremap.c |   69 ---
 arch/arm/mm/mm.h  |   10 --
 arch/arm/mm/mmu.c |   55 +
 arch/arm/mm/static_vm.c   |   97 
 arch/arm/mm/vmregion.c|  205 -
 arch/arm/mm/vmregion.h|   31 -
 8 files changed, 208 insertions(+), 312 deletions(-)
 create mode 100644 arch/arm/include/asm/mach/static_vm.h
 create mode 100644 arch/arm/mm/static_vm.c
 delete mode 100644 arch/arm/mm/vmregion.c
 delete mode 100644 arch/arm/mm/vmregion.h

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 1/3] ARM: vmregion: remove vmregion code entirely

2012-11-27 Thread Joonsoo Kim

Now, there is no user for vmregion.
So remove it.

Signed-off-by: Joonsoo Kim 

diff --git a/arch/arm/mm/Makefile b/arch/arm/mm/Makefile
index 8a9c4cb..4e333fa 100644
--- a/arch/arm/mm/Makefile
+++ b/arch/arm/mm/Makefile
@@ -6,7 +6,7 @@ obj-y   := dma-mapping.o extable.o 
fault.o init.o \
   iomap.o
 
 obj-$(CONFIG_MMU)  += fault-armv.o flush.o idmap.o ioremap.o \
-  mmap.o pgd.o mmu.o vmregion.o
+  mmap.o pgd.o mmu.o
 
 ifneq ($(CONFIG_MMU),y)
 obj-y  += nommu.o
diff --git a/arch/arm/mm/vmregion.c b/arch/arm/mm/vmregion.c
deleted file mode 100644
index a631016..000
--- a/arch/arm/mm/vmregion.c
+++ /dev/null
@@ -1,205 +0,0 @@
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-#include "vmregion.h"
-
-/*
- * VM region handling support.
- *
- * This should become something generic, handling VM region allocations for
- * vmalloc and similar (ioremap, module space, etc).
- *
- * I envisage vmalloc()'s supporting vm_struct becoming:
- *
- *  struct vm_struct {
- *struct vmregion  region;
- *unsigned longflags;
- *struct page  **pages;
- *unsigned int nr_pages;
- *unsigned longphys_addr;
- *  };
- *
- * get_vm_area() would then call vmregion_alloc with an appropriate
- * struct vmregion head (eg):
- *
- *  struct vmregion vmalloc_head = {
- * .vm_list= LIST_HEAD_INIT(vmalloc_head.vm_list),
- * .vm_start   = VMALLOC_START,
- * .vm_end = VMALLOC_END,
- *  };
- *
- * However, vmalloc_head.vm_start is variable (typically, it is dependent on
- * the amount of RAM found at boot time.)  I would imagine that get_vm_area()
- * would have to initialise this each time prior to calling vmregion_alloc().
- */
-
-struct arm_vmregion *
-arm_vmregion_alloc(struct arm_vmregion_head *head, size_t align,
-  size_t size, gfp_t gfp, const void *caller)
-{
-   unsigned long start = head->vm_start, addr = head->vm_end;
-   unsigned long flags;
-   struct arm_vmregion *c, *new;
-
-   if (head->vm_end - head->vm_start < size) {
-   printk(KERN_WARNING "%s: allocation too big (requested %#x)\n",
-   __func__, size);
-   goto out;
-   }
-
-   new = kmalloc(sizeof(struct arm_vmregion), gfp);
-   if (!new)
-   goto out;
-
-   new->caller = caller;
-
-   spin_lock_irqsave(&head->vm_lock, flags);
-
-   addr = rounddown(addr - size, align);
-   list_for_each_entry_reverse(c, &head->vm_list, vm_list) {
-   if (addr >= c->vm_end)
-   goto found;
-   addr = rounddown(c->vm_start - size, align);
-   if (addr < start)
-   goto nospc;
-   }
-
- found:
-   /*
-* Insert this entry after the one we found.
-*/
-   list_add(&new->vm_list, &c->vm_list);
-   new->vm_start = addr;
-   new->vm_end = addr + size;
-   new->vm_active = 1;
-
-   spin_unlock_irqrestore(&head->vm_lock, flags);
-   return new;
-
- nospc:
-   spin_unlock_irqrestore(&head->vm_lock, flags);
-   kfree(new);
- out:
-   return NULL;
-}
-
-static struct arm_vmregion *__arm_vmregion_find(struct arm_vmregion_head 
*head, unsigned long addr)
-{
-   struct arm_vmregion *c;
-
-   list_for_each_entry(c, &head->vm_list, vm_list) {
-   if (c->vm_active && c->vm_start == addr)
-   goto out;
-   }
-   c = NULL;
- out:
-   return c;
-}
-
-struct arm_vmregion *arm_vmregion_find(struct arm_vmregion_head *head, 
unsigned long addr)
-{
-   struct arm_vmregion *c;
-   unsigned long flags;
-
-   spin_lock_irqsave(&head->vm_lock, flags);
-   c = __arm_vmregion_find(head, addr);
-   spin_unlock_irqrestore(&head->vm_lock, flags);
-   return c;
-}
-
-struct arm_vmregion *arm_vmregion_find_remove(struct arm_vmregion_head *head, 
unsigned long addr)
-{
-   struct arm_vmregion *c;
-   unsigned long flags;
-
-   spin_lock_irqsave(&head->vm_lock, flags);
-   c = __arm_vmregion_find(head, addr);
-   if (c)
-   c->vm_active = 0;
-   spin_unlock_irqrestore(&head->vm_lock, flags);
-   return c;
-}
-
-void arm_vmregion_free(struct arm_vmregion_head *head, struct arm_vmregion *c)
-{
-   unsigned long flags;
-
-   spin_lock_irqsave(&head->vm_lock, flags);
-   list_del(&c->vm_list);
-   spin_unlock_irqrestore(&head->vm_lock, flags);
-
-   kfree(c);
-}
-
-#ifdef CONFIG_PROC_FS
-static int arm_vmregion_show(struct seq_file *m, void *p)
-{
-   struct arm_vmregion *c = list_entry(p, struct arm_vmregion, vm_list);
-
-

[PATCH v2 2/3] ARM: static_vm: introduce an infrastructure for static mapped area

2012-11-27 Thread Joonsoo Kim

In current implementation, we used ARM-specific flag, that is,
VM_ARM_STATIC_MAPPING, for distinguishing ARM specific static mapped area.
The purpose of static mapped area is to re-use static mapped area when
entire physical address range of the ioremap request can be covered
by this area.

This implementation causes needless overhead for some cases.
For example, assume that there is only one static mapped area and
vmlist has 300 areas. Every time we call ioremap, we check 300 areas for
deciding whether it is matched or not. Moreover, even if there is
no static mapped area and vmlist has 300 areas, every time we call
ioremap, we check 300 areas in now.

If we construct a extra list for static mapped area, we can eliminate
above mentioned overhead.
With a extra list, if there is one static mapped area,
we just check only one area and proceed next operation quickly.

In fact, it is not a critical problem, because ioremap is not frequently
used. But reducing overhead is better idea.

Another reason for doing this work is for removing architecture dependency
on vmalloc layer. I think that vmlist and vmlist_lock is internal data
structure for vmalloc layer. Some codes for debugging and stat inevitably
use vmlist and vmlist_lock. But it is preferable that they are used
as least as possible in outside of vmalloc.c

Now, I introduce an ARM-specific infrastructure for static mapped area. In
the following patch, we will use this and resolve above mentioned problem.

Signed-off-by: Joonsoo Kim 

diff --git a/arch/arm/include/asm/mach/static_vm.h 
b/arch/arm/include/asm/mach/static_vm.h
new file mode 100644
index 000..1bb6604
--- /dev/null
+++ b/arch/arm/include/asm/mach/static_vm.h
@@ -0,0 +1,45 @@
+/*
+ * arch/arm/include/asm/mach/static_vm.h
+ *
+ * Copyright (C) 2012 LG Electronics, Joonsoo Kim 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#ifndef _ASM_MACH_STATIC_VM_H
+#define _ASM_MACH_STATIC_VM_H
+
+#include 
+#include 
+
+struct static_vm {
+   struct static_vm*next;
+   void*vaddr;
+   unsigned long   size;
+   unsigned long   flags;
+   phys_addr_t paddr;
+   const void  *caller;
+};
+
+extern struct static_vm *static_vmlist;
+extern spinlock_t static_vmlist_lock;
+
+extern struct static_vm *find_static_vm_paddr(phys_addr_t paddr,
+   size_t size, unsigned long flags);
+extern struct static_vm *find_static_vm_vaddr(void *vaddr, unsigned long 
flags);
+extern void init_static_vm(struct static_vm *static_vm,
+   struct vm_struct *vm, unsigned long flags);
+extern void insert_static_vm(struct static_vm *vm);
+
+#endif /* _ASM_MACH_STATIC_VM_H */
diff --git a/arch/arm/mm/Makefile b/arch/arm/mm/Makefile
index 4e333fa..57b329a 100644
--- a/arch/arm/mm/Makefile
+++ b/arch/arm/mm/Makefile
@@ -6,7 +6,7 @@ obj-y   := dma-mapping.o extable.o 
fault.o init.o \
   iomap.o
 
 obj-$(CONFIG_MMU)  += fault-armv.o flush.o idmap.o ioremap.o \
-  mmap.o pgd.o mmu.o
+  mmap.o pgd.o mmu.o static_vm.o
 
 ifneq ($(CONFIG_MMU),y)
 obj-y  += nommu.o
diff --git a/arch/arm/mm/static_vm.c b/arch/arm/mm/static_vm.c
new file mode 100644
index 000..d7677cf
--- /dev/null
+++ b/arch/arm/mm/static_vm.c
@@ -0,0 +1,97 @@
+/*
+ * arch/arm/mm/static_vm.c
+ *
+ * Copyright (C) 2012 LG Electronics, Joonsoo Kim 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include 
+
+#include 
+
+struct static_vm *static_vmlist;
+DEFINE_SPINLOCK(static_vmlist_lock);
+
+struct static_vm *find_static_vm_paddr(phys_addr_t paddr

[PATCH v2 3/3] ARM: mm: use static_vm for managing static mapped areas

2012-11-27 Thread Joonsoo Kim

A static mapped area is ARM-specific, so it is better not to use
generic vmalloc data structure, that is, vmlist and vmlist_lock
for managing static mapped area. And it causes some needless overhead and
reducing this overhead is better idea.

Now, we have newly introduced static_vm infrastructure.
With it, we don't need to iterate all mapped areas. Instead, we just
iterate static mapped areas. It helps to reduce an overhead of finding
matched area. And architecture dependency on vmalloc layer is removed,
so it will help to maintainability for vmalloc layer.

Signed-off-by: Joonsoo Kim 

diff --git a/arch/arm/include/asm/mach/static_vm.h 
b/arch/arm/include/asm/mach/static_vm.h
index 1bb6604..0d9c685 100644
--- a/arch/arm/include/asm/mach/static_vm.h
+++ b/arch/arm/include/asm/mach/static_vm.h
@@ -32,6 +32,12 @@ struct static_vm {
const void  *caller;
 };
 
+#define STATIC_VM_MEM  0x0001
+#define STATIC_VM_EMPTY0x0002
+
+/* mtype should be less than 28 */
+#define STATIC_VM_MTYPE(mt)(1UL << ((mt) + 4))
+
 extern struct static_vm *static_vmlist;
 extern spinlock_t static_vmlist_lock;
 
diff --git a/arch/arm/mm/ioremap.c b/arch/arm/mm/ioremap.c
index 5dcc2fd..b7f3c27 100644
--- a/arch/arm/mm/ioremap.c
+++ b/arch/arm/mm/ioremap.c
@@ -36,6 +36,7 @@
 #include 
 
 #include 
+#include 
 #include 
 #include "mm.h"
 
@@ -197,7 +198,8 @@ void __iomem * __arm_ioremap_pfn_caller(unsigned long pfn,
const struct mem_type *type;
int err;
unsigned long addr;
-   struct vm_struct * area;
+   struct vm_struct *area;
+   phys_addr_t paddr = __pfn_to_phys(pfn);
 
 #ifndef CONFIG_ARM_LPAE
/*
@@ -219,24 +221,17 @@ void __iomem * __arm_ioremap_pfn_caller(unsigned long pfn,
/*
 * Try to reuse one of the static mapping whenever possible.
 */
-   read_lock(&vmlist_lock);
-   for (area = vmlist; area; area = area->next) {
-   if (!size || (sizeof(phys_addr_t) == 4 && pfn >= 0x10))
-   break;
-   if (!(area->flags & VM_ARM_STATIC_MAPPING))
-   continue;
-   if ((area->flags & VM_ARM_MTYPE_MASK) != VM_ARM_MTYPE(mtype))
-   continue;
-   if (__phys_to_pfn(area->phys_addr) > pfn ||
-   __pfn_to_phys(pfn) + size-1 > area->phys_addr + 
area->size-1)
-   continue;
-   /* we can drop the lock here as we know *area is static */
-   read_unlock(&vmlist_lock);
-   addr = (unsigned long)area->addr;
-   addr += __pfn_to_phys(pfn) - area->phys_addr;
-   return (void __iomem *) (offset + addr);
+   if (size && !((sizeof(phys_addr_t) == 4 && pfn >= 0x10))) {
+   struct static_vm *static_vm;
+
+   static_vm = find_static_vm_paddr(__pfn_to_phys(pfn), size,
+   STATIC_VM_MEM | STATIC_VM_MTYPE(mtype));
+   if (static_vm) {
+   addr = (unsigned long)static_vm->vaddr;
+   addr += paddr - static_vm->paddr;
+   return (void __iomem *) (offset + addr);
+   }
}
-   read_unlock(&vmlist_lock);
 
/*
 * Don't allow RAM to be mapped - this causes problems with ARMv6+
@@ -248,7 +243,7 @@ void __iomem * __arm_ioremap_pfn_caller(unsigned long pfn,
if (!area)
return NULL;
addr = (unsigned long)area->addr;
-   area->phys_addr = __pfn_to_phys(pfn);
+   area->phys_addr = paddr;
 
 #if !defined(CONFIG_SMP) && !defined(CONFIG_ARM_LPAE)
if (DOMAIN_IO == 0 &&
@@ -346,34 +341,20 @@ __arm_ioremap_exec(unsigned long phys_addr, size_t size, 
bool cached)
 void __iounmap(volatile void __iomem *io_addr)
 {
void *addr = (void *)(PAGE_MASK & (unsigned long)io_addr);
-   struct vm_struct *vm;
-
-   read_lock(&vmlist_lock);
-   for (vm = vmlist; vm; vm = vm->next) {
-   if (vm->addr > addr)
-   break;
-   if (!(vm->flags & VM_IOREMAP))
-   continue;
-   /* If this is a static mapping we must leave it alone */
-   if ((vm->flags & VM_ARM_STATIC_MAPPING) &&
-   (vm->addr <= addr) && (vm->addr + vm->size > addr)) {
-   read_unlock(&vmlist_lock);
-   return;
-   }
+   struct static_vm *static_vm;
+
+   static_vm = find_static_vm_vaddr(addr, STATIC_VM_MEM);
+   if (static_vm)
+   return;
+
 #if !defined(CONFIG_SMP) && !defined(CONFIG_ARM_LPAE)
-   /*
-* If this is a section based mapping we need to h

[PATCH v2] ARM: mm: clean-up in order to reduce to call kmap_high_get()

2013-03-18 Thread Joonsoo Kim

In kmap_atomic(), kmap_high_get() is invoked for checking already
mapped area. In __flush_dcache_page() and dma_cache_maint_page(),
we explicitly call kmap_high_get() before kmap_atomic()
when cache_is_vipt(), so kmap_high_get() can be invoked twice.
This is useless operation, so remove one.

v2: change cache_is_vipt() to cache_is_vipt_nonaliasing() in order to
be self-documented

Acked-by: Nicolas Pitre 
Signed-off-by: Joonsoo Kim 
---
Hello Nicolas.
I maintain your 'Acked-by' while updating this patch to v2.
Please let me know if there is problem.

Thanks.

diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index e9db6b4..ef3e0f3 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -823,16 +823,17 @@ static void dma_cache_maint_page(struct page *page, 
unsigned long offset,
if (PageHighMem(page)) {
if (len + offset > PAGE_SIZE)
len = PAGE_SIZE - offset;
-   vaddr = kmap_high_get(page);
-   if (vaddr) {
-   vaddr += offset;
-   op(vaddr, len, dir);
-   kunmap_high(page);
-   } else if (cache_is_vipt()) {
-   /* unmapped pages might still be cached */
+
+   if (cache_is_vipt_nonaliasing()) {
vaddr = kmap_atomic(page);
op(vaddr + offset, len, dir);
kunmap_atomic(vaddr);
+   } else {
+   vaddr = kmap_high_get(page);
+   if (vaddr) {
+   op(vaddr + offset, len, dir);
+   kunmap_high(page);
+   }
}
} else {
vaddr = page_address(page) + offset;
diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c
index 1c8f7f5..0d473cc 100644
--- a/arch/arm/mm/flush.c
+++ b/arch/arm/mm/flush.c
@@ -170,15 +170,18 @@ void __flush_dcache_page(struct address_space *mapping, 
struct page *page)
if (!PageHighMem(page)) {
__cpuc_flush_dcache_area(page_address(page), PAGE_SIZE);
} else {
-   void *addr = kmap_high_get(page);
-   if (addr) {
-   __cpuc_flush_dcache_area(addr, PAGE_SIZE);
-   kunmap_high(page);
-   } else if (cache_is_vipt()) {
-   /* unmapped pages might still be cached */
+   void *addr;
+
+   if (cache_is_vipt_nonaliasing()) {
addr = kmap_atomic(page);
__cpuc_flush_dcache_area(addr, PAGE_SIZE);
kunmap_atomic(addr);
+   } else {
+   addr = kmap_high_get(page);
+   if (addr) {
+   __cpuc_flush_dcache_area(addr, PAGE_SIZE);
+   kunmap_high(page);
+   }
}
}
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] ARM: mm: disable kmap_high_get() for SMP

2013-03-18 Thread Joonsoo Kim

On Thu, Mar 07, 2013 at 07:35:51PM +0900, JoonSoo Kim wrote:
> 2013/3/7 Nicolas Pitre :
> > On Thu, 7 Mar 2013, Joonsoo Kim wrote:
> >
> >> Hello, Nicolas.
> >>
> >> On Tue, Mar 05, 2013 at 05:36:12PM +0800, Nicolas Pitre wrote:
> >> > On Mon, 4 Mar 2013, Joonsoo Kim wrote:
> >> >
> >> > > With SMP and enabling kmap_high_get(), it makes users of kmap_atomic()
> >> > > sequential ordered, because kmap_high_get() use global kmap_lock().
> >> > > It is not welcome situation, so turn off this optimization for SMP.
> >> >
> >> > I'm not sure I understand the problem.
> >> >
> >> > The lock taken by kmap_high_get() is released right away before that
> >> > function returns and therefore this is not actually serializing
> >> > anything.
> >>
> >> Yes, you understand what I want to say correctly.
> >> Sorry for bad explanation.
> >>
> >> Following is reasons why I send this patch with RFC tag.
> >>
> >> If we have more cpus, performance degration is possible although
> >> it is very short time to holding the lock in kmap_high_get().
> >>
> >> And kmap has maximum 512 entries(512 * 4K = 2M) and some mobile devices
> >> has 2G memory(highmem 1G>), so probability for finding matched entry
> >> is approximately < 1/512. This probability can be more decreasing
> >> for device which have more memory. So I think that waste time to find
> >> matched entry is more than saved time.
> >>
> >> Above is my humble opinion, so please let me know what I am missing.
> >
> > Please look at the kmap_high_get() code again.  It performs no
> > searching at all.  What it does is:
> 
> If page is not highmem, it may be already filtered in kmap_atomic().
> So we only consider highmem page.
> 
> For highmem page, it perform searching.
> In kmap_high_get(), page_address() is called.
> In page_address(), it hash PA and iterate a list for this hashed value.
> 
> And another advantage of disabling ARCH_NEEDS_KMAP_HIGH_GET is
> that kmap(), kunmap() works without irq disabled.
> 
> Thanks.

Hello, Nicolas.

For just confirm, you don't agree with this, right?

Thanks.

> 
> > - lock the kmap array against concurrent changes
> >
> > - if the given page is not highmem, unlock and return NULL
> >
> > - otherwise increment that page reference count, unlock, and return the
> >   mapped address for that page.
> >
> > There is almost zero cost to this function, independently of the number
> > of kmap entries, whereas it does save much bigger costs elsewhere when
> > it is successful.
> >
> >
> > Nicolas
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 1/3] slub: correct to calculate num of acquired objects in get_partial_node()

2013-03-18 Thread Joonsoo Kim

Hello, Pekka.

Could you pick up 1/3, 3/3?
These are already acked by Christoph.
2/3 is same effect as Glauber's "slub: correctly bootstrap boot caches",
so should skip it.

Thanks.

On Mon, Jan 21, 2013 at 05:01:25PM +0900, Joonsoo Kim wrote:
> There is a subtle bug when calculating a number of acquired objects.
> 
> Currently, we calculate "available = page->objects - page->inuse",
> after acquire_slab() is called in get_partial_node().
> 
> In acquire_slab() with mode = 1, we always set new.inuse = page->objects.
> So,
> 
>   acquire_slab(s, n, page, object == NULL);
> 
>   if (!object) {
>   c->page = page;
>   stat(s, ALLOC_FROM_PARTIAL);
>   object = t;
>   available = page->objects - page->inuse;
> 
>   !!! availabe is always 0 !!!
>   ...
> 
> Therfore, "available > s->cpu_partial / 2" is always false and
> we always go to second iteration.
> This patch correct this problem.
> 
> After that, we don't need return value of put_cpu_partial().
> So remove it.
> 
> v2: calculate nr of objects using new.objects and new.inuse.
> It is more accurate way than before.
> 
> Signed-off-by: Joonsoo Kim 
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index ba2ca53..7204c74 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1493,7 +1493,7 @@ static inline void remove_partial(struct 
> kmem_cache_node *n,
>   */
>  static inline void *acquire_slab(struct kmem_cache *s,
>   struct kmem_cache_node *n, struct page *page,
> - int mode)
> + int mode, int *objects)
>  {
>   void *freelist;
>   unsigned long counters;
> @@ -1507,6 +1507,7 @@ static inline void *acquire_slab(struct kmem_cache *s,
>   freelist = page->freelist;
>   counters = page->counters;
>   new.counters = counters;
> + *objects = new.objects - new.inuse;
>   if (mode) {
>   new.inuse = page->objects;
>   new.freelist = NULL;
> @@ -1528,7 +1529,7 @@ static inline void *acquire_slab(struct kmem_cache *s,
>   return freelist;
>  }
>  
> -static int put_cpu_partial(struct kmem_cache *s, struct page *page, int 
> drain);
> +static void put_cpu_partial(struct kmem_cache *s, struct page *page, int 
> drain);
>  static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags);
>  
>  /*
> @@ -1539,6 +1540,8 @@ static void *get_partial_node(struct kmem_cache *s, 
> struct kmem_cache_node *n,
>  {
>   struct page *page, *page2;
>   void *object = NULL;
> + int available = 0;
> + int objects;
>  
>   /*
>* Racy check. If we mistakenly see no partial slabs then we
> @@ -1552,22 +1555,21 @@ static void *get_partial_node(struct kmem_cache *s, 
> struct kmem_cache_node *n,
>   spin_lock(&n->list_lock);
>   list_for_each_entry_safe(page, page2, &n->partial, lru) {
>   void *t;
> - int available;
>  
>   if (!pfmemalloc_match(page, flags))
>   continue;
>  
> - t = acquire_slab(s, n, page, object == NULL);
> + t = acquire_slab(s, n, page, object == NULL, &objects);
>   if (!t)
>   break;
>  
> + available += objects;
>   if (!object) {
>   c->page = page;
>   stat(s, ALLOC_FROM_PARTIAL);
>   object = t;
> - available =  page->objects - page->inuse;
>   } else {
> - available = put_cpu_partial(s, page, 0);
> + put_cpu_partial(s, page, 0);
>   stat(s, CPU_PARTIAL_NODE);
>   }
>   if (kmem_cache_debug(s) || available > s->cpu_partial / 2)
> @@ -1946,7 +1948,7 @@ static void unfreeze_partials(struct kmem_cache *s,
>   * If we did not find a slot then simply move all the partials to the
>   * per node partial list.
>   */
> -static int put_cpu_partial(struct kmem_cache *s, struct page *page, int 
> drain)
> +static void put_cpu_partial(struct kmem_cache *s, struct page *page, int 
> drain)
>  {
>   struct page *oldpage;
>   int pages;
> @@ -1984,7 +1986,6 @@ static int put_cpu_partial(struct kmem_cache *s, struct 
> page *page, int drain)
>   page->next = oldpage;
>  
>   } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page) != 
> oldpage);
> - return pobjects;
>  }
>  
>  static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
> -- 
> 1.7.9.5
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/8] correct load_balance()

2013-03-18 Thread Joonsoo Kim

On Mon, Feb 25, 2013 at 01:56:59PM +0900, Joonsoo Kim wrote:
> On Thu, Feb 14, 2013 at 02:48:33PM +0900, Joonsoo Kim wrote:
> > Commit 88b8dac0 makes load_balance() consider other cpus in its group.
> > But, there are some missing parts for this feature to work properly.
> > This patchset correct these things and make load_balance() robust.
> > 
> > Others are related to LBF_ALL_PINNED. This is fallback functionality
> > when all tasks can't be moved as cpu affinity. But, currently,
> > if imbalance is not large enough to task's load, we leave LBF_ALL_PINNED
> > flag and 'redo' is triggered. This is not our intention, so correct it.
> > 
> > These are based on v3.8-rc7.
> > 
> > Joonsoo Kim (8):
> >   sched: change position of resched_cpu() in load_balance()
> >   sched: explicitly cpu_idle_type checking in rebalance_domains()
> >   sched: don't consider other cpus in our group in case of NEWLY_IDLE
> >   sched: clean up move_task() and move_one_task()
> >   sched: move up affinity check to mitigate useless redoing overhead
> >   sched: rename load_balance_tmpmask to load_balance_cpu_active
> >   sched: prevent to re-select dst-cpu in load_balance()
> >   sched: reset lb_env when redo in load_balance()
> > 
> >  kernel/sched/core.c |9 +++--
> >  kernel/sched/fair.c |  107 
> > +--
> >  2 files changed, 67 insertions(+), 49 deletions(-)
> 
> Hello, Ingo and Peter.
> 
> Could you review this patch set?
> Please let me know what I should do for merging this?

Hello.
One more ping :)

> 
> Thanks.
> 
> > -- 
> > 1.7.9.5
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/3] mm, nobootmem: clean-up of free_low_memory_core_early()

2013-03-18 Thread Joonsoo Kim

Remove unused argument and make function static,
because there is no user outside of nobootmem.c

Signed-off-by: Joonsoo Kim 

diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index cdc3bab..5f0b0e1 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -44,7 +44,6 @@ extern unsigned long init_bootmem_node(pg_data_t *pgdat,
   unsigned long endpfn);
 extern unsigned long init_bootmem(unsigned long addr, unsigned long memend);
 
-extern unsigned long free_low_memory_core_early(int nodeid);
 extern unsigned long free_all_bootmem_node(pg_data_t *pgdat);
 extern unsigned long free_all_bootmem(void);
 
diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index 4711e91..589c673 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -120,7 +120,7 @@ static unsigned long __init __free_memory_core(phys_addr_t 
start,
return end_pfn - start_pfn;
 }
 
-unsigned long __init free_low_memory_core_early(int nodeid)
+static unsigned long __init free_low_memory_core_early()
 {
unsigned long count = 0;
phys_addr_t start, end, size;
@@ -170,7 +170,7 @@ unsigned long __init free_all_bootmem(void)
 *  because in some case like Node0 doesn't have RAM installed
 *  low ram will be on Node1
 */
-   return free_low_memory_core_early(MAX_NUMNODES);
+   return free_low_memory_core_early();
 }
 
 /**
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/3] mm, nobootmem: fix wrong usage of max_low_pfn

2013-03-18 Thread Joonsoo Kim

max_low_pfn reflect the number of _pages_ in the system,
not the maximum PFN. You can easily find that fact in init_bootmem().
So fix it.

Additionally, if 'start_pfn == end_pfn', we don't need to go futher,
so change range check.

Signed-off-by: Joonsoo Kim 

diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index 5e07d36..4711e91 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -110,9 +110,9 @@ static unsigned long __init __free_memory_core(phys_addr_t 
start,
 {
unsigned long start_pfn = PFN_UP(start);
unsigned long end_pfn = min_t(unsigned long,
- PFN_DOWN(end), max_low_pfn);
+ PFN_DOWN(end), min_low_pfn);
 
-   if (start_pfn > end_pfn)
+   if (start_pfn >= end_pfn)
return 0;
 
__free_pages_memory(start_pfn, end_pfn);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/3] mm, nobootmem: do memset() after memblock_reserve()

2013-03-18 Thread Joonsoo Kim

Currently, we do memset() before reserving the area.
This may not cause any problem, but it is somewhat weird.
So change execution order.

Signed-off-by: Joonsoo Kim 

diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index 589c673..f11ec1c 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -46,8 +46,8 @@ static void * __init __alloc_memory_core_early(int nid, u64 
size, u64 align,
return NULL;
 
ptr = phys_to_virt(addr);
-   memset(ptr, 0, size);
memblock_reserve(addr, size);
+   memset(ptr, 0, size);
/*
 * The min_count is set to 0 so that bootmem allocated blocks
 * are never reported as leaks.
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/3] mm, nobootmem: clean-up of free_low_memory_core_early()

2013-03-18 Thread Joonsoo Kim

On Tue, Mar 19, 2013 at 02:16:00PM +0900, Joonsoo Kim wrote:
> Remove unused argument and make function static,
> because there is no user outside of nobootmem.c
> 
> Signed-off-by: Joonsoo Kim 
> 
> diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
> index cdc3bab..5f0b0e1 100644
> --- a/include/linux/bootmem.h
> +++ b/include/linux/bootmem.h
> @@ -44,7 +44,6 @@ extern unsigned long init_bootmem_node(pg_data_t *pgdat,
>  unsigned long endpfn);
>  extern unsigned long init_bootmem(unsigned long addr, unsigned long memend);
>  
> -extern unsigned long free_low_memory_core_early(int nodeid);
>  extern unsigned long free_all_bootmem_node(pg_data_t *pgdat);
>  extern unsigned long free_all_bootmem(void);
>  
> diff --git a/mm/nobootmem.c b/mm/nobootmem.c
> index 4711e91..589c673 100644
> --- a/mm/nobootmem.c
> +++ b/mm/nobootmem.c
> @@ -120,7 +120,7 @@ static unsigned long __init 
> __free_memory_core(phys_addr_t start,
>   return end_pfn - start_pfn;
>  }
>  
> -unsigned long __init free_low_memory_core_early(int nodeid)
> +static unsigned long __init free_low_memory_core_early()
>  {
>   unsigned long count = 0;
>   phys_addr_t start, end, size;
> @@ -170,7 +170,7 @@ unsigned long __init free_all_bootmem(void)
>*  because in some case like Node0 doesn't have RAM installed
>*  low ram will be on Node1
>*/
> - return free_low_memory_core_early(MAX_NUMNODES);
> + return free_low_memory_core_early();
>  }
>  
>  /**
> -- 
> 1.7.9.5
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 

Sorry, this patch makes build warning.
Below is fixed version.

>&
>From 05f4a768dd5c514113916908f4710f8863704ed9 Mon Sep 17 00:00:00 2001
From: Joonsoo Kim 
Date: Mon, 18 Mar 2013 14:17:57 +0900
Subject: [PATCH] mm, nobootmem: clean-up of free_low_memory_core_early()

Remove unused argument and make function static,
because there is no user outside of nobootmem.c

Signed-off-by: Joonsoo Kim 

diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index cdc3bab..5f0b0e1 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -44,7 +44,6 @@ extern unsigned long init_bootmem_node(pg_data_t *pgdat,
   unsigned long endpfn);
 extern unsigned long init_bootmem(unsigned long addr, unsigned long memend);
 
-extern unsigned long free_low_memory_core_early(int nodeid);
 extern unsigned long free_all_bootmem_node(pg_data_t *pgdat);
 extern unsigned long free_all_bootmem(void);
 
diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index 4711e91..9c38698 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -120,7 +120,7 @@ static unsigned long __init __free_memory_core(phys_addr_t 
start,
return end_pfn - start_pfn;
 }
 
-unsigned long __init free_low_memory_core_early(int nodeid)
+static unsigned long __init free_low_memory_core_early(void)
 {
unsigned long count = 0;
phys_addr_t start, end, size;
@@ -170,7 +170,7 @@ unsigned long __init free_all_bootmem(void)
 *  because in some case like Node0 doesn't have RAM installed
 *  low ram will be on Node1
 */
-   return free_low_memory_core_early(MAX_NUMNODES);
+   return free_low_memory_core_early();
 }
 
 /**
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] mm, nobootmem: do memset() after memblock_reserve()

2013-03-18 Thread Joonsoo Kim

On Mon, Mar 18, 2013 at 10:53:04PM -0700, Yinghai Lu wrote:
> On Mon, Mar 18, 2013 at 10:16 PM, Joonsoo Kim  wrote:
> > Currently, we do memset() before reserving the area.
> > This may not cause any problem, but it is somewhat weird.
> > So change execution order.
> >
> > Signed-off-by: Joonsoo Kim 
> >
> > diff --git a/mm/nobootmem.c b/mm/nobootmem.c
> > index 589c673..f11ec1c 100644
> > --- a/mm/nobootmem.c
> > +++ b/mm/nobootmem.c
> > @@ -46,8 +46,8 @@ static void * __init __alloc_memory_core_early(int nid, 
> > u64 size, u64 align,
> > return NULL;
> >
> > ptr = phys_to_virt(addr);
> > -   memset(ptr, 0, size);
> > memblock_reserve(addr, size);
> > +   memset(ptr, 0, size);
> 
> move down ptr = ... too ?
Okay.
I will send v2 soon.

> 
> > /*
> >  * The min_count is set to 0 so that bootmem allocated blocks
> >  * are never reported as leaks.
> > --
> > 1.7.9.5
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/3] mm, nobootmem: clean-up of free_low_memory_core_early()

2013-03-18 Thread Joonsoo Kim

On Mon, Mar 18, 2013 at 10:51:43PM -0700, Yinghai Lu wrote:
> On Mon, Mar 18, 2013 at 10:16 PM, Joonsoo Kim  wrote:
> > Remove unused argument and make function static,
> > because there is no user outside of nobootmem.c
> >
> > Signed-off-by: Joonsoo Kim 
> >
> > diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
> > index cdc3bab..5f0b0e1 100644
> > --- a/include/linux/bootmem.h
> > +++ b/include/linux/bootmem.h
> > @@ -44,7 +44,6 @@ extern unsigned long init_bootmem_node(pg_data_t *pgdat,
> >unsigned long endpfn);
> >  extern unsigned long init_bootmem(unsigned long addr, unsigned long 
> > memend);
> >
> > -extern unsigned long free_low_memory_core_early(int nodeid);
> >  extern unsigned long free_all_bootmem_node(pg_data_t *pgdat);
> >  extern unsigned long free_all_bootmem(void);
> >
> > diff --git a/mm/nobootmem.c b/mm/nobootmem.c
> > index 4711e91..589c673 100644
> > --- a/mm/nobootmem.c
> > +++ b/mm/nobootmem.c
> > @@ -120,7 +120,7 @@ static unsigned long __init 
> > __free_memory_core(phys_addr_t start,
> > return end_pfn - start_pfn;
> >  }
> >
> > -unsigned long __init free_low_memory_core_early(int nodeid)
> > +static unsigned long __init free_low_memory_core_early()
> 
> (void) ?

Yes, fixed version is already sent.
Thanks.

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] mm, nobootmem: fix wrong usage of max_low_pfn

2013-03-18 Thread Joonsoo Kim

On Mon, Mar 18, 2013 at 10:47:41PM -0700, Yinghai Lu wrote:
> On Mon, Mar 18, 2013 at 10:15 PM, Joonsoo Kim  wrote:
> > max_low_pfn reflect the number of _pages_ in the system,
> > not the maximum PFN. You can easily find that fact in init_bootmem().
> > So fix it.
> 
> I'm confused. for x86, we have max_low_pfn defined in ...

Below is queote from Russell King in 'https://lkml.org/lkml/2013/3/13/123'


 Now, max_low_pfn is initialized this way:
 
 /**
  * init_bootmem - register boot memory
  * @start: pfn where the bitmap is to be placed
  * @pages: number of available physical pages
  *
  * Returns the number of bytes needed to hold the bitmap.
  */
 unsigned long __init init_bootmem(unsigned long start, unsigned long pages)
 {
max_low_pfn = pages;
min_low_pfn = start;
return init_bootmem_core(NODE_DATA(0)->bdata, start, 0, pages);
 }
 So, min_low_pfn is the PFN offset of the start of physical memory (so
 3GB >> PAGE_SHIFT) and max_low_pfn ends up being the number of pages,
 _not_ the maximum PFN value

So, if physical address doesn't start at 0, max_low_pfn doesn't represent 
the maximum PFN value. This is a case for ARM.

> 
> #ifdef CONFIG_X86_32
> /* max_low_pfn get updated here */
> find_low_pfn_range();
> #else
> num_physpages = max_pfn;
> 
> check_x2apic();
> 
> /* How many end-of-memory variables you have, grandma! */
> /* need this before calling reserve_initrd */
> if (max_pfn > (1UL<<(32 - PAGE_SHIFT)))
> max_low_pfn = e820_end_of_low_ram_pfn();
> else
> max_low_pfn = max_pfn;
> 
> and under max_low_pfn is bootmem.
> 
> >
> > Additionally, if 'start_pfn == end_pfn', we don't need to go futher,
> > so change range check.
> >
> > Signed-off-by: Joonsoo Kim 
> >
> > diff --git a/mm/nobootmem.c b/mm/nobootmem.c
> > index 5e07d36..4711e91 100644
> > --- a/mm/nobootmem.c
> > +++ b/mm/nobootmem.c
> > @@ -110,9 +110,9 @@ static unsigned long __init 
> > __free_memory_core(phys_addr_t start,
> >  {
> > unsigned long start_pfn = PFN_UP(start);
> > unsigned long end_pfn = min_t(unsigned long,
> > - PFN_DOWN(end), max_low_pfn);
> > + PFN_DOWN(end), min_low_pfn);
> 
> what is min_low_pfn ?  is it 0 for x86?

My implementation is totally wrong. :)
min_low_pfn is not proper value for this purpose.

I will fix it.
Sorry for noise.

Thanks.

> 
> Thanks
> 
> Yinghai
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] mm, nobootmem: fix wrong usage of max_low_pfn

2013-03-18 Thread Joonsoo Kim

On Tue, Mar 19, 2013 at 03:25:22PM +0900, Joonsoo Kim wrote:
> On Mon, Mar 18, 2013 at 10:47:41PM -0700, Yinghai Lu wrote:
> > On Mon, Mar 18, 2013 at 10:15 PM, Joonsoo Kim  
> > wrote:
> > > max_low_pfn reflect the number of _pages_ in the system,
> > > not the maximum PFN. You can easily find that fact in init_bootmem().
> > > So fix it.
> > 
> > I'm confused. for x86, we have max_low_pfn defined in ...
> 
> Below is queote from Russell King in 'https://lkml.org/lkml/2013/3/13/123'
> 
> 
>  Now, max_low_pfn is initialized this way:
>  
>  /**
>   * init_bootmem - register boot memory
>   * @start: pfn where the bitmap is to be placed
>   * @pages: number of available physical pages
>   *
>   * Returns the number of bytes needed to hold the bitmap.
>   */
>  unsigned long __init init_bootmem(unsigned long start, unsigned long pages)
>  {
> max_low_pfn = pages;
> min_low_pfn = start;
> return init_bootmem_core(NODE_DATA(0)->bdata, start, 0, pages);
>  }
>  So, min_low_pfn is the PFN offset of the start of physical memory (so
>  3GB >> PAGE_SHIFT) and max_low_pfn ends up being the number of pages,
>  _not_ the maximum PFN value
> 
> So, if physical address doesn't start at 0, max_low_pfn doesn't represent 
> the maximum PFN value. This is a case for ARM.
> 
> > 
> > #ifdef CONFIG_X86_32
> > /* max_low_pfn get updated here */
> > find_low_pfn_range();
> > #else
> > num_physpages = max_pfn;
> > 
> > check_x2apic();
> > 
> > /* How many end-of-memory variables you have, grandma! */
> > /* need this before calling reserve_initrd */
> > if (max_pfn > (1UL<<(32 - PAGE_SHIFT)))
> > max_low_pfn = e820_end_of_low_ram_pfn();
> > else
> > max_low_pfn = max_pfn;
> > 
> > and under max_low_pfn is bootmem.
> > 
> > >
> > > Additionally, if 'start_pfn == end_pfn', we don't need to go futher,
> > > so change range check.
> > >
> > > Signed-off-by: Joonsoo Kim 
> > >
> > > diff --git a/mm/nobootmem.c b/mm/nobootmem.c
> > > index 5e07d36..4711e91 100644
> > > --- a/mm/nobootmem.c
> > > +++ b/mm/nobootmem.c
> > > @@ -110,9 +110,9 @@ static unsigned long __init 
> > > __free_memory_core(phys_addr_t start,
> > >  {
> > > unsigned long start_pfn = PFN_UP(start);
> > > unsigned long end_pfn = min_t(unsigned long,
> > > - PFN_DOWN(end), max_low_pfn);
> > > + PFN_DOWN(end), min_low_pfn);
> > 
> > what is min_low_pfn ?  is it 0 for x86?
> 
> My implementation is totally wrong. :)
> min_low_pfn is not proper value for this purpose.
> 
> I will fix it.
> Sorry for noise.
> 
> Thanks.

How about using "memblock.current_limit"?

unsigned long end_pfn = min_t(unsigned long, PFN_DOWN(end),
memblock.current_limit);

Thanks.

> 
> > 
> > Thanks
> > 
> > Yinghai
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majord...@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: mailto:"d...@kvack.org";> em...@kvack.org 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] mm, nobootmem: fix wrong usage of max_low_pfn

2013-03-19 Thread Joonsoo Kim

On Tue, Mar 19, 2013 at 12:35:45AM -0700, Yinghai Lu wrote:
> Can you check why sparc do not need to change interface during converting
> to use memblock to replace bootmem?

Sure.
According to my understanding to sparc32 code(arch/sparc/mm/init_32.c),
they already use max_low_pfn as the maximum PFN value,
not as the number of pages.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/8] sched: explicitly cpu_idle_type checking in rebalance_domains()

2013-03-19 Thread Joonsoo Kim

Hello, Peter.

On Tue, Mar 19, 2013 at 03:02:21PM +0100, Peter Zijlstra wrote:
> On Thu, 2013-02-14 at 14:48 +0900, Joonsoo Kim wrote:
> > After commit 88b8dac0, dst-cpu can be changed in load_balance(),
> > then we can't know cpu_idle_type of dst-cpu when load_balance()
> > return positive. So, add explicit cpu_idle_type checking.
> 
> No real objection I suppose, but did you actually see this go wrong?

No, I found it while I review whole scheduler code.
Thanks.

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/8] sched: don't consider other cpus in our group in case of NEWLY_IDLE

2013-03-19 Thread Joonsoo Kim

On Tue, Mar 19, 2013 at 03:20:57PM +0100, Peter Zijlstra wrote:
> On Thu, 2013-02-14 at 14:48 +0900, Joonsoo Kim wrote:
> > Commit 88b8dac0 makes load_balance() consider other cpus in its group,
> > regardless of idle type. When we do NEWLY_IDLE balancing, we should not
> > consider it, because a motivation of NEWLY_IDLE balancing is to turn
> > this cpu to non idle state if needed. This is not the case of other cpus.
> > So, change code not to consider other cpus for NEWLY_IDLE balancing.
> > 
> > With this patch, assign 'if (pulled_task) this_rq->idle_stamp = 0'
> > in idle_balance() is corrected, because NEWLY_IDLE balancing doesn't
> > consider other cpus. Assigning to 'this_rq->idle_stamp' is now valid.
> > 
> > Cc: Srivatsa Vaddagiri 
> > Signed-off-by: Joonsoo Kim 
> 
> Fair enough, good catch.
> 
> Acked-by: Peter Zijlstra 
> 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 0c6aaf6..97498f4 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5016,8 +5016,15 @@ static int load_balance(int this_cpu, struct rq 
> > *this_rq,
> > .cpus   = cpus,
> > };
> >  
> > +   /* For NEWLY_IDLE load_balancing, we don't need to consider
> > +* other cpus in our group */
> > +   if (idle == CPU_NEWLY_IDLE) {
> > +   env.dst_grpmask = NULL;
> > +   max_lb_iterations = 0;
> 
> Just a small nit; I don't think we'll ever get to evaluate
> max_lb_iterations when !dst_grpmask. So strictly speaking its
> superfluous to touch it.

Okay. In next spin, I will remove it and add a comment here.

Thanks.

> 
> > +   } else {
> > +   max_lb_iterations = cpumask_weight(env.dst_grpmask);
> > +   }
> > cpumask_copy(cpus, cpu_active_mask);
> > -   max_lb_iterations = cpumask_weight(env.dst_grpmask);
> >  
> > schedstat_inc(sd, lb_count[idle]);
> >  
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/8] sched: clean up move_task() and move_one_task()

2013-03-20 Thread Joonsoo Kim

On Tue, Mar 19, 2013 at 03:30:15PM +0100, Peter Zijlstra wrote:
> On Thu, 2013-02-14 at 14:48 +0900, Joonsoo Kim wrote:
> > Some validation for task moving is performed in move_tasks() and
> > move_one_task(). We can move these code to can_migrate_task()
> > which is already exist for this purpose.
> 
> > @@ -4011,18 +4027,7 @@ static int move_tasks(struct lb_env *env)
> > break;
> > }
> >  
> > -   if (throttled_lb_pair(task_group(p), env->src_cpu, 
> > env->dst_cpu))
> > -   goto next;
> > -
> > -   load = task_h_load(p);
> > -
> > -   if (sched_feat(LB_MIN) && load < 16 && 
> > !env->sd->nr_balance_failed)
> > -   goto next;
> > -
> > -   if ((load / 2) > env->imbalance)
> > -   goto next;
> > -
> > -   if (!can_migrate_task(p, env))
> > +   if (!can_migrate_task(p, env, false, &load))
> > goto next;
> >  
> > move_task(p, env);
> 
> Right, so I'm not so taken with this one. The whole load stuff really
> is a balance heuristic that's part of move_tasks(), move_one_task()
> really doesn't care about that.
> 
> So why did you include it? Purely so you didn't have to re-order the
> tests? I don't see any reason not to flip a tests around.

I think that I'm not fully understand what you are concerning, because of
my poor English. If possible, please elaborate on a problem in more detail.

First of all, I do my best to answer your question.

Patch 4/8, 5/8 are for mitigating useless redoing overhead caused
by LBF_ALL_PINNED. For this purpose, we should check 'cpu affinity'
before evaluating a load. Just moving up can_migrate_task() above
load evaluation code may raise side effect, because can_migrate_task() have
other checking which is 'cache hottness'. I don't want a side effect. So
embedding load evaluation to can_migrate_task() and re-order checking and
makes load evaluation disabled for move_one_task().

If your recommandation is to move up can_mirate_task() above
load evaluation code, yes, I can, and will do that. :)

Please let me know what I am misunderstand.

Thanks.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 6/8] sched: rename load_balance_tmpmask to load_balance_cpu_active

2013-03-20 Thread Joonsoo Kim

On Tue, Mar 19, 2013 at 04:01:01PM +0100, Peter Zijlstra wrote:
> On Thu, 2013-02-14 at 14:48 +0900, Joonsoo Kim wrote:
> > This name doesn't represent specific meaning.
> > So rename it to imply it's purpose.
> > 
> > Signed-off-by: Joonsoo Kim 
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 26058d0..e6f8783 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -6814,7 +6814,7 @@ struct task_group root_task_group;
> >  LIST_HEAD(task_groups);
> >  #endif
> >  
> > -DECLARE_PER_CPU(cpumask_var_t, load_balance_tmpmask);
> > +DECLARE_PER_CPU(cpumask_var_t, load_balance_cpu_active);
> 
> That's not much better; how about we call it: load_balance_mask.

Okay, in next spin, I will call it as load_balance_mask.

Thanks.

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 7/8] sched: prevent to re-select dst-cpu in load_balance()

2013-03-20 Thread Joonsoo Kim

On Tue, Mar 19, 2013 at 04:05:46PM +0100, Peter Zijlstra wrote:
> On Thu, 2013-02-14 at 14:48 +0900, Joonsoo Kim wrote:
> > Commit 88b8dac0 makes load_balance() consider other cpus in its group.
> > But, in that, there is no code for preventing to re-select dst-cpu.
> > So, same dst-cpu can be selected over and over.
> > 
> > This patch add functionality to load_balance() in order to exclude
> > cpu which is selected once.
> 
> Oh man.. seriously? Did you see this happen? Also, can't we simply
> remove it from lb->cpus?

I didn't see it, I do just logical thinking. :)
lb->cpus is for source cpus and dst-cpu is for dest cpus. So, it doesn't
works to remove it from lb->cpus.

Thanks.

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 8/8] sched: reset lb_env when redo in load_balance()

2013-03-20 Thread Joonsoo Kim

On Tue, Mar 19, 2013 at 04:21:23PM +0100, Peter Zijlstra wrote:
> On Thu, 2013-02-14 at 14:48 +0900, Joonsoo Kim wrote:
> > Commit 88b8dac0 makes load_balance() consider other cpus in its group.
> > So, now, When we redo in load_balance(), we should reset some fields of
> > lb_env to ensure that load_balance() works for initial cpu, not for other
> > cpus in its group. So correct it.
> > 
> > Cc: Srivatsa Vaddagiri 
> > Signed-off-by: Joonsoo Kim 
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 70631e8..25c798c 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5014,14 +5014,20 @@ static int load_balance(int this_cpu, struct rq 
> > *this_rq,
> >  
> > struct lb_env env = {
> > .sd = sd,
> > -   .dst_cpu= this_cpu,
> > -   .dst_rq = this_rq,
> > .dst_grpmask= dst_grp,
> > .idle   = idle,
> > -   .loop_break = sched_nr_migrate_break,
> > .cpus   = cpus,
> > };
> >  
> > +   schedstat_inc(sd, lb_count[idle]);
> > +   cpumask_copy(cpus, cpu_active_mask);
> > +
> > +redo:
> > +   env.dst_cpu = this_cpu;
> > +   env.dst_rq = this_rq;
> > +   env.loop = 0;
> > +   env.loop_break = sched_nr_migrate_break;
> > +
> > /* For NEWLY_IDLE load_balancing, we don't need to consider
> >  * other cpus in our group */
> > if (idle == CPU_NEWLY_IDLE) {
> 
> OK, so this is the case where we tried to balance !this_cpu and found
> ALL_PINNED, right?
> 
> You can only get here in very weird cases where people love their
> sched_setaffinity() way too much, do we care? Why not give up?

Now that you mentioned it, I have no enough reason for this patch.
I think that giving up is more preferable to me.
I will omit this patch for next spin.

> 
> Also, looking at this, shouldn't we consider env->cpus in
> can_migrate_task() where we compute new_dst_cpu?

As previously stated, env-cpus is for src cpus, so when we decide dst_cpu,
it doesn't matter.

Really thanks for detailed review to all this patchset.
Thanks.

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/8] sched: clean up move_task() and move_one_task()

2013-03-20 Thread JoonSoo Kim

2013/3/20 Peter Zijlstra :
> On Wed, 2013-03-20 at 16:33 +0900, Joonsoo Kim wrote:
>
>> > Right, so I'm not so taken with this one. The whole load stuff really
>> > is a balance heuristic that's part of move_tasks(), move_one_task()
>> > really doesn't care about that.
>> >
>> > So why did you include it? Purely so you didn't have to re-order the
>> > tests? I don't see any reason not to flip a tests around.
>>
>> I think that I'm not fully understand what you are concerning, because of
>> my poor English. If possible, please elaborate on a problem in more detail.
>
> OK, so your initial Changelog said it wanted to remove some code
> duplication between move_tasks() and move_one_task(); but then you put
> in the load heuristics and add a boolean argument to only enable those
> for move_tasks() -- so clearly that wasn't duplicated.
>
> So why move that code.. I proposed that this was due a reluctance to
> re-arrange the various tests that stop the migration from happening.
>
> Now you say:
>
>> ... Just moving up can_migrate_task() above
>> load evaluation code may raise side effect, because can_migrate_task() have
>> other checking which is 'cache hottness'. I don't want a side effect. So
>> embedding load evaluation to can_migrate_task() and re-order checking and
>> makes load evaluation disabled for move_one_task().
>
> Which pretty much affirms this. However I also said that I don't think
> the order really matters that much; each test will cancel the migration
> of this task; the order of these tests seem immaterial.
>
>> If your recommandation is to move up can_mirate_task() above
>> load evaluation code, yes, I can, and will do that. :)
>
> I would actually propose moving the throttled test into
> can_migrate_task() and leave it at that.

Okay. I will do that in next spin.

Thanks!!

>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 7/8] sched: prevent to re-select dst-cpu in load_balance()

2013-03-20 Thread JoonSoo Kim

2013/3/20 Peter Zijlstra :
> On Wed, 2013-03-20 at 16:43 +0900, Joonsoo Kim wrote:
>> On Tue, Mar 19, 2013 at 04:05:46PM +0100, Peter Zijlstra wrote:
>> > On Thu, 2013-02-14 at 14:48 +0900, Joonsoo Kim wrote:
>> > > Commit 88b8dac0 makes load_balance() consider other cpus in its group.
>> > > But, in that, there is no code for preventing to re-select dst-cpu.
>> > > So, same dst-cpu can be selected over and over.
>> > >
>> > > This patch add functionality to load_balance() in order to exclude
>> > > cpu which is selected once.
>> >
>> > Oh man.. seriously? Did you see this happen? Also, can't we simply
>> > remove it from lb->cpus?
>>
>> I didn't see it, I do just logical thinking. :)
>> lb->cpus is for source cpus and dst-cpu is for dest cpus. So, it doesn't
>> works to remove it from lb->cpus.
>
> How about we interpret ->cpus as the total mask to balance; so both
> source and destination. That way clearing a cpu means we won't take nor
> put tasks on it.

In my quick thought, it may be possible.
I will try to do it.

Thanks.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHSET wq/for-3.10] workqueue: break up workqueue_lock into multiple locks

2013-03-20 Thread JoonSoo Kim

2013/3/19 Tejun Heo :
> On Wed, Mar 13, 2013 at 07:57:18PM -0700, Tejun Heo wrote:
>> and available in the following git branch.
>>
>>  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git review-finer-locking
>
> Applied to wq/for-3.10.

Hello, Tejun.

I know I am late, but, please give me a change to ask a question.

Finer locking for workqueue code is really needed?
Is there a performance issue?
I think that there is too many locks and locking rules,
although the description about these are very nice.

Thanks.

> Thanks.
>
> --
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 01/10] workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]

2013-03-20 Thread JoonSoo Kim

2013/3/20 Tejun Heo :
> Unbound workqueues are going to be NUMA-affine.  Add wq_numa_tbl_len
> and wq_numa_possible_cpumask[] in preparation.  The former is the
> highest NUMA node ID + 1 and the latter is masks of possibles CPUs for
> each NUMA node.

It is better to move this code to topology.c or cpumask.c,
then it can be generally used.

Thanks.

> This patch only introduces these.  Future patches will make use of
> them.
>
> Signed-off-by: Tejun Heo 
> ---
>  kernel/workqueue.c | 35 ++-
>  1 file changed, 34 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 775c2f4..9b096e3 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -44,6 +44,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include "workqueue_internal.h"
>
> @@ -256,6 +257,11 @@ struct workqueue_struct {
>
>  static struct kmem_cache *pwq_cache;
>
> +static int wq_numa_tbl_len;/* highest possible NUMA node id + 1 
> */
> +static cpumask_var_t *wq_numa_possible_cpumask;
> +   /* possible CPUs of each node, may
> +  be NULL if init failed */
> +
>  static DEFINE_MUTEX(wq_mutex); /* protects workqueues and pools */
>  static DEFINE_SPINLOCK(pwq_lock);  /* protects pool_workqueues */
>  static DEFINE_SPINLOCK(wq_mayday_lock);/* protects wq->maydays list 
> */
> @@ -4416,7 +4422,7 @@ out_unlock:
>  static int __init init_workqueues(void)
>  {
> int std_nice[NR_STD_WORKER_POOLS] = { 0, HIGHPRI_NICE_LEVEL };
> -   int i, cpu;
> +   int i, node, cpu;
>
> /* make sure we have enough bits for OFFQ pool ID */
> BUILD_BUG_ON((1LU << (BITS_PER_LONG - WORK_OFFQ_POOL_SHIFT)) <
> @@ -4429,6 +4435,33 @@ static int __init init_workqueues(void)
> cpu_notifier(workqueue_cpu_up_callback, CPU_PRI_WORKQUEUE_UP);
> hotcpu_notifier(workqueue_cpu_down_callback, CPU_PRI_WORKQUEUE_DOWN);
>
> +   /* determine NUMA pwq table len - highest node id + 1 */
> +   for_each_node(node)
> +   wq_numa_tbl_len = max(wq_numa_tbl_len, node + 1);
> +
> +   /*
> +* We want masks of possible CPUs of each node which isn't readily
> +* available.  Build one from cpu_to_node() which should have been
> +* fully initialized by now.
> +*/
> +   wq_numa_possible_cpumask = kzalloc(wq_numa_tbl_len *
> +  
> sizeof(wq_numa_possible_cpumask[0]),
> +  GFP_KERNEL);
> +   BUG_ON(!wq_numa_possible_cpumask);
> +
> +   for_each_node(node)
> +   
> BUG_ON(!alloc_cpumask_var_node(&wq_numa_possible_cpumask[node],
> +  GFP_KERNEL, node));
> +   for_each_possible_cpu(cpu) {
> +   node = cpu_to_node(cpu);
> +   if (WARN_ON(node == NUMA_NO_NODE)) {
> +   pr_err("workqueue: NUMA node mapping not available 
> for cpu%d, disabling NUMA support\n", cpu);
> +   wq_numa_possible_cpumask = NULL;
> +   break;
> +   }
> +   cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]);
> +   }
> +
> /* initialize CPU pools */
> for_each_possible_cpu(cpu) {
> struct worker_pool *pool;
> --
> 1.8.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v5 1/3] ARM: vmregion: remove vmregion code entirely

2013-02-04 Thread Joonsoo Kim

Now, there is no user for vmregion.
So remove it.

Acked-by: Nicolas Pitre 
Signed-off-by: Joonsoo Kim 

diff --git a/arch/arm/mm/Makefile b/arch/arm/mm/Makefile
index 8a9c4cb..4e333fa 100644
--- a/arch/arm/mm/Makefile
+++ b/arch/arm/mm/Makefile
@@ -6,7 +6,7 @@ obj-y   := dma-mapping.o extable.o 
fault.o init.o \
   iomap.o
 
 obj-$(CONFIG_MMU)  += fault-armv.o flush.o idmap.o ioremap.o \
-  mmap.o pgd.o mmu.o vmregion.o
+  mmap.o pgd.o mmu.o
 
 ifneq ($(CONFIG_MMU),y)
 obj-y  += nommu.o
diff --git a/arch/arm/mm/vmregion.c b/arch/arm/mm/vmregion.c
deleted file mode 100644
index a631016..000
--- a/arch/arm/mm/vmregion.c
+++ /dev/null
@@ -1,205 +0,0 @@
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-#include "vmregion.h"
-
-/*
- * VM region handling support.
- *
- * This should become something generic, handling VM region allocations for
- * vmalloc and similar (ioremap, module space, etc).
- *
- * I envisage vmalloc()'s supporting vm_struct becoming:
- *
- *  struct vm_struct {
- *struct vmregion  region;
- *unsigned longflags;
- *struct page  **pages;
- *unsigned int nr_pages;
- *unsigned longphys_addr;
- *  };
- *
- * get_vm_area() would then call vmregion_alloc with an appropriate
- * struct vmregion head (eg):
- *
- *  struct vmregion vmalloc_head = {
- * .vm_list= LIST_HEAD_INIT(vmalloc_head.vm_list),
- * .vm_start   = VMALLOC_START,
- * .vm_end = VMALLOC_END,
- *  };
- *
- * However, vmalloc_head.vm_start is variable (typically, it is dependent on
- * the amount of RAM found at boot time.)  I would imagine that get_vm_area()
- * would have to initialise this each time prior to calling vmregion_alloc().
- */
-
-struct arm_vmregion *
-arm_vmregion_alloc(struct arm_vmregion_head *head, size_t align,
-  size_t size, gfp_t gfp, const void *caller)
-{
-   unsigned long start = head->vm_start, addr = head->vm_end;
-   unsigned long flags;
-   struct arm_vmregion *c, *new;
-
-   if (head->vm_end - head->vm_start < size) {
-   printk(KERN_WARNING "%s: allocation too big (requested %#x)\n",
-   __func__, size);
-   goto out;
-   }
-
-   new = kmalloc(sizeof(struct arm_vmregion), gfp);
-   if (!new)
-   goto out;
-
-   new->caller = caller;
-
-   spin_lock_irqsave(&head->vm_lock, flags);
-
-   addr = rounddown(addr - size, align);
-   list_for_each_entry_reverse(c, &head->vm_list, vm_list) {
-   if (addr >= c->vm_end)
-   goto found;
-   addr = rounddown(c->vm_start - size, align);
-   if (addr < start)
-   goto nospc;
-   }
-
- found:
-   /*
-* Insert this entry after the one we found.
-*/
-   list_add(&new->vm_list, &c->vm_list);
-   new->vm_start = addr;
-   new->vm_end = addr + size;
-   new->vm_active = 1;
-
-   spin_unlock_irqrestore(&head->vm_lock, flags);
-   return new;
-
- nospc:
-   spin_unlock_irqrestore(&head->vm_lock, flags);
-   kfree(new);
- out:
-   return NULL;
-}
-
-static struct arm_vmregion *__arm_vmregion_find(struct arm_vmregion_head 
*head, unsigned long addr)
-{
-   struct arm_vmregion *c;
-
-   list_for_each_entry(c, &head->vm_list, vm_list) {
-   if (c->vm_active && c->vm_start == addr)
-   goto out;
-   }
-   c = NULL;
- out:
-   return c;
-}
-
-struct arm_vmregion *arm_vmregion_find(struct arm_vmregion_head *head, 
unsigned long addr)
-{
-   struct arm_vmregion *c;
-   unsigned long flags;
-
-   spin_lock_irqsave(&head->vm_lock, flags);
-   c = __arm_vmregion_find(head, addr);
-   spin_unlock_irqrestore(&head->vm_lock, flags);
-   return c;
-}
-
-struct arm_vmregion *arm_vmregion_find_remove(struct arm_vmregion_head *head, 
unsigned long addr)
-{
-   struct arm_vmregion *c;
-   unsigned long flags;
-
-   spin_lock_irqsave(&head->vm_lock, flags);
-   c = __arm_vmregion_find(head, addr);
-   if (c)
-   c->vm_active = 0;
-   spin_unlock_irqrestore(&head->vm_lock, flags);
-   return c;
-}
-
-void arm_vmregion_free(struct arm_vmregion_head *head, struct arm_vmregion *c)
-{
-   unsigned long flags;
-
-   spin_lock_irqsave(&head->vm_lock, flags);
-   list_del(&c->vm_list);
-   spin_unlock_irqrestore(&head->vm_lock, flags);
-
-   kfree(c);
-}
-
-#ifdef CONFIG_PROC_FS
-static int arm_vmregion_show(struct seq_file *m, void *p)
-{
-   struct arm_vmregion *c = list_entry(p, struct arm_vmre

[PATCH v5 2/3] ARM: ioremap: introduce an infrastructure for static mapped area

2013-02-04 Thread Joonsoo Kim

In current implementation, we used ARM-specific flag, that is,
VM_ARM_STATIC_MAPPING, for distinguishing ARM specific static mapped area.
The purpose of static mapped area is to re-use static mapped area when
entire physical address range of the ioremap request can be covered
by this area.

This implementation causes needless overhead for some cases.
For example, assume that there is only one static mapped area and
vmlist has 300 areas. Every time we call ioremap, we check 300 areas for
deciding whether it is matched or not. Moreover, even if there is
no static mapped area and vmlist has 300 areas, every time we call
ioremap, we check 300 areas in now.

If we construct a extra list for static mapped area, we can eliminate
above mentioned overhead.
With a extra list, if there is one static mapped area,
we just check only one area and proceed next operation quickly.

In fact, it is not a critical problem, because ioremap is not frequently
used. But reducing overhead is better idea.

Another reason for doing this work is for removing architecture dependency
on vmalloc layer. I think that vmlist and vmlist_lock is internal data
structure for vmalloc layer. Some codes for debugging and stat inevitably
use vmlist and vmlist_lock. But it is preferable that they are used
as least as possible in outside of vmalloc.c

Now, I introduce an ARM-specific infrastructure for static mapped area. In
the following patch, we will use this and resolve above mentioned problem.

Signed-off-by: Joonsoo Kim 

diff --git a/arch/arm/mm/ioremap.c b/arch/arm/mm/ioremap.c
index 88fd86c..904c15e 100644
--- a/arch/arm/mm/ioremap.c
+++ b/arch/arm/mm/ioremap.c
@@ -39,6 +39,70 @@
 #include 
 #include "mm.h"
 
+
+LIST_HEAD(static_vmlist);
+
+static struct static_vm *find_static_vm_paddr(phys_addr_t paddr,
+   size_t size, unsigned int mtype)
+{
+   struct static_vm *svm;
+   struct vm_struct *vm;
+
+   list_for_each_entry(svm, &static_vmlist, list) {
+   vm = &svm->vm;
+   if (!(vm->flags & VM_ARM_STATIC_MAPPING))
+   continue;
+   if ((vm->flags & VM_ARM_MTYPE_MASK) != VM_ARM_MTYPE(mtype))
+   continue;
+
+   if (vm->phys_addr > paddr ||
+   paddr + size - 1 > vm->phys_addr + vm->size - 1)
+   continue;
+
+   return svm;
+   }
+
+   return NULL;
+}
+
+struct static_vm *find_static_vm_vaddr(void *vaddr)
+{
+   struct static_vm *svm;
+   struct vm_struct *vm;
+
+   list_for_each_entry(svm, &static_vmlist, list) {
+   vm = &svm->vm;
+
+   /* static_vmlist is ascending order */
+   if (vm->addr > vaddr)
+   break;
+
+   if (vm->addr <= vaddr && vm->addr + vm->size > vaddr)
+   return svm;
+   }
+
+   return NULL;
+}
+
+void __init add_static_vm_early(struct static_vm *svm)
+{
+   struct static_vm *curr_svm;
+   struct vm_struct *vm;
+   void *vaddr;
+
+   vm = &svm->vm;
+   vm_area_add_early(vm);
+   vaddr = vm->addr;
+
+   list_for_each_entry(curr_svm, &static_vmlist, list) {
+   vm = &curr_svm->vm;
+
+   if (vm->addr > vaddr)
+   break;
+   }
+   list_add_tail(&svm->list, &curr_svm->list);
+}
+
 int ioremap_page(unsigned long virt, unsigned long phys,
 const struct mem_type *mtype)
 {
diff --git a/arch/arm/mm/mm.h b/arch/arm/mm/mm.h
index a8ee92d..d5a4e9a 100644
--- a/arch/arm/mm/mm.h
+++ b/arch/arm/mm/mm.h
@@ -1,4 +1,6 @@
 #ifdef CONFIG_MMU
+#include 
+#include 
 
 /* the upper-most page table pointer */
 extern pmd_t *top_pmd;
@@ -65,6 +67,16 @@ extern void __flush_dcache_page(struct address_space 
*mapping, struct page *page
 /* consistent regions used by dma_alloc_attrs() */
 #define VM_ARM_DMA_CONSISTENT  0x2000
 
+
+struct static_vm {
+   struct vm_struct vm;
+   struct list_head list;
+};
+
+extern struct list_head static_vmlist;
+extern struct static_vm *find_static_vm_vaddr(void *vaddr);
+extern __init void add_static_vm_early(struct static_vm *svm);
+
 #endif
 
 #ifdef CONFIG_ZONE_DMA
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v5 0/3] introduce static_vm for ARM-specific static mapped area

2013-02-04 Thread Joonsoo Kim

In current implementation, we used ARM-specific flag, that is,
VM_ARM_STATIC_MAPPING, for distinguishing ARM specific static mapped area.
The purpose of static mapped area is to re-use static mapped area when
entire physical address range of the ioremap request can be covered
by this area.

This implementation causes needless overhead for some cases.
For example, assume that there is only one static mapped area and
vmlist has 300 areas. Every time we call ioremap, we check 300 areas for
deciding whether it is matched or not. Moreover, even if there is
no static mapped area and vmlist has 300 areas, every time we call
ioremap, we check 300 areas in now.

If we construct a extra list for static mapped area, we can eliminate
above mentioned overhead.
With a extra list, if there is one static mapped area,
we just check only one area and proceed next operation quickly.

In fact, it is not a critical problem, because ioremap is not frequently
used. But reducing overhead is better idea.

Another reason for doing this work is for removing vm_struct list management,
entirely. For more information, look at the following link.
http://lkml.org/lkml/2012/12/6/184

Changelog
v4->v5:
  [2/3]: Changes from Nicolas' suggestion
 - don't use separate flags for static_vm
 - remove a lock
 - declare add_static_vm_early() as __init
  [3/3]: Changes from Nicolas' suggestion
 - add / leave comments

v3->v4:
  [2/3]: Changes from Nicolas' suggestion
 - embed static_vm code in ioremap.c
 - simplify struct static_vm
 - remove init_static_vm, instead, add_static_vm_early() init static_vm
 Use generic list for list management of static_vm
 Convert spin_lock to rw_lock
 Modify static_vm's flags bits
  [3/3]: Rework according to [2/3] change
  Rebased on v3.8-rc5

v2->v3:
  coverletter: refer a link related to this work
  [2/3]: drop @flags of find_static_vm_vaddr
  Rebased on v3.8-rc4

v1->v2: 
  [2/3]: patch description is improved.
  Rebased on v3.7-rc7

Joonsoo Kim (3):
  ARM: vmregion: remove vmregion code entirely
  ARM: ioremap: introduce an infrastructure for static mapped area
  ARM: mm: use static_vm for managing static mapped areas

 arch/arm/mm/Makefile   |2 +-
 arch/arm/mm/ioremap.c  |  135 +--
 arch/arm/mm/mm.h   |   12 +++
 arch/arm/mm/mmu.c  |   34 
 arch/arm/mm/vmregion.c |  205 
 arch/arm/mm/vmregion.h |   31 
 6 files changed, 123 insertions(+), 296 deletions(-)
 delete mode 100644 arch/arm/mm/vmregion.c
 delete mode 100644 arch/arm/mm/vmregion.h

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v5 3/3] ARM: mm: use static_vm for managing static mapped areas

2013-02-04 Thread Joonsoo Kim

A static mapped area is ARM-specific, so it is better not to use
generic vmalloc data structure, that is, vmlist and vmlist_lock
for managing static mapped area. And it causes some needless overhead and
reducing this overhead is better idea.

Now, we have newly introduced static_vm infrastructure.
With it, we don't need to iterate all mapped areas. Instead, we just
iterate static mapped areas. It helps to reduce an overhead of finding
matched area. And architecture dependency on vmalloc layer is removed,
so it will help to maintainability for vmalloc layer.

Signed-off-by: Joonsoo Kim 

diff --git a/arch/arm/mm/ioremap.c b/arch/arm/mm/ioremap.c
index 904c15e..c7fef4b 100644
--- a/arch/arm/mm/ioremap.c
+++ b/arch/arm/mm/ioremap.c
@@ -261,13 +261,14 @@ void __iomem * __arm_ioremap_pfn_caller(unsigned long pfn,
const struct mem_type *type;
int err;
unsigned long addr;
-   struct vm_struct * area;
+   struct vm_struct *area;
+   phys_addr_t paddr = __pfn_to_phys(pfn);
 
 #ifndef CONFIG_ARM_LPAE
/*
 * High mappings must be supersection aligned
 */
-   if (pfn >= 0x10 && (__pfn_to_phys(pfn) & ~SUPERSECTION_MASK))
+   if (pfn >= 0x10 && (paddr & ~SUPERSECTION_MASK))
return NULL;
 #endif
 
@@ -283,24 +284,16 @@ void __iomem * __arm_ioremap_pfn_caller(unsigned long pfn,
/*
 * Try to reuse one of the static mapping whenever possible.
 */
-   read_lock(&vmlist_lock);
-   for (area = vmlist; area; area = area->next) {
-   if (!size || (sizeof(phys_addr_t) == 4 && pfn >= 0x10))
-   break;
-   if (!(area->flags & VM_ARM_STATIC_MAPPING))
-   continue;
-   if ((area->flags & VM_ARM_MTYPE_MASK) != VM_ARM_MTYPE(mtype))
-   continue;
-   if (__phys_to_pfn(area->phys_addr) > pfn ||
-   __pfn_to_phys(pfn) + size-1 > area->phys_addr + 
area->size-1)
-   continue;
-   /* we can drop the lock here as we know *area is static */
-   read_unlock(&vmlist_lock);
-   addr = (unsigned long)area->addr;
-   addr += __pfn_to_phys(pfn) - area->phys_addr;
-   return (void __iomem *) (offset + addr);
+   if (size && !((sizeof(phys_addr_t) == 4 && pfn >= 0x10))) {
+   struct static_vm *svm;
+
+   svm = find_static_vm_paddr(paddr, size, mtype);
+   if (svm) {
+   addr = (unsigned long)svm->vm.addr;
+   addr += paddr - svm->vm.phys_addr;
+   return (void __iomem *) (offset + addr);
+   }
}
-   read_unlock(&vmlist_lock);
 
/*
 * Don't allow RAM to be mapped - this causes problems with ARMv6+
@@ -312,21 +305,21 @@ void __iomem * __arm_ioremap_pfn_caller(unsigned long pfn,
if (!area)
return NULL;
addr = (unsigned long)area->addr;
-   area->phys_addr = __pfn_to_phys(pfn);
+   area->phys_addr = paddr;
 
 #if !defined(CONFIG_SMP) && !defined(CONFIG_ARM_LPAE)
if (DOMAIN_IO == 0 &&
(((cpu_architecture() >= CPU_ARCH_ARMv6) && (get_cr() & CR_XP)) ||
   cpu_is_xsc3()) && pfn >= 0x10 &&
-  !((__pfn_to_phys(pfn) | size | addr) & ~SUPERSECTION_MASK)) {
+  !((paddr | size | addr) & ~SUPERSECTION_MASK)) {
area->flags |= VM_ARM_SECTION_MAPPING;
err = remap_area_supersections(addr, pfn, size, type);
-   } else if (!((__pfn_to_phys(pfn) | size | addr) & ~PMD_MASK)) {
+   } else if (!((paddr | size | addr) & ~PMD_MASK)) {
area->flags |= VM_ARM_SECTION_MAPPING;
err = remap_area_sections(addr, pfn, size, type);
} else
 #endif
-   err = ioremap_page_range(addr, addr + size, __pfn_to_phys(pfn),
+   err = ioremap_page_range(addr, addr + size, paddr,
 __pgprot(type->prot_pte));
 
if (err) {
@@ -410,34 +403,28 @@ __arm_ioremap_exec(unsigned long phys_addr, size_t size, 
bool cached)
 void __iounmap(volatile void __iomem *io_addr)
 {
void *addr = (void *)(PAGE_MASK & (unsigned long)io_addr);
-   struct vm_struct *vm;
+   struct static_vm *svm;
+
+   /* If this is a static mapping, we must leave it alone */
+   svm = find_static_vm_vaddr(addr);
+   if (svm)
+   return;
 
-   read_lock(&vmlist_lock);
-   for (vm = vmlist; vm; vm = vm->next) {
-   if (vm->addr > addr)
-   break;
-   if (!(vm->flags & VM_IOREMAP))
-

Re: [PATCH v5 3/3] ARM: mm: use static_vm for managing static mapped areas

2013-02-05 Thread Joonsoo Kim

Hello, Nicolas.

On Mon, Feb 04, 2013 at 11:44:16PM -0500, Nicolas Pitre wrote:
> On Tue, 5 Feb 2013, Joonsoo Kim wrote:
> 
> > A static mapped area is ARM-specific, so it is better not to use
> > generic vmalloc data structure, that is, vmlist and vmlist_lock
> > for managing static mapped area. And it causes some needless overhead and
> > reducing this overhead is better idea.
> > 
> > Now, we have newly introduced static_vm infrastructure.
> > With it, we don't need to iterate all mapped areas. Instead, we just
> > iterate static mapped areas. It helps to reduce an overhead of finding
> > matched area. And architecture dependency on vmalloc layer is removed,
> > so it will help to maintainability for vmalloc layer.
> > 
> > Signed-off-by: Joonsoo Kim 
> 
> Some comments below.
> 
> > diff --git a/arch/arm/mm/ioremap.c b/arch/arm/mm/ioremap.c
> > index 904c15e..c7fef4b 100644
> > --- a/arch/arm/mm/ioremap.c
> > +++ b/arch/arm/mm/ioremap.c
> > @@ -261,13 +261,14 @@ void __iomem * __arm_ioremap_pfn_caller(unsigned long 
> > pfn,
> > const struct mem_type *type;
> > int err;
> > unsigned long addr;
> > -   struct vm_struct * area;
> > +   struct vm_struct *area;
> > +   phys_addr_t paddr = __pfn_to_phys(pfn);
> >  
> >  #ifndef CONFIG_ARM_LPAE
> > /*
> >  * High mappings must be supersection aligned
> >  */
> > -   if (pfn >= 0x10 && (__pfn_to_phys(pfn) & ~SUPERSECTION_MASK))
> > +   if (pfn >= 0x10 && (paddr & ~SUPERSECTION_MASK))
> > return NULL;
> >  #endif
> >  
> > @@ -283,24 +284,16 @@ void __iomem * __arm_ioremap_pfn_caller(unsigned long 
> > pfn,
> > /*
> >  * Try to reuse one of the static mapping whenever possible.
> >  */
> > -   read_lock(&vmlist_lock);
> > -   for (area = vmlist; area; area = area->next) {
> > -   if (!size || (sizeof(phys_addr_t) == 4 && pfn >= 0x10))
> > -   break;
> > -   if (!(area->flags & VM_ARM_STATIC_MAPPING))
> > -   continue;
> > -   if ((area->flags & VM_ARM_MTYPE_MASK) != VM_ARM_MTYPE(mtype))
> > -   continue;
> > -   if (__phys_to_pfn(area->phys_addr) > pfn ||
> > -   __pfn_to_phys(pfn) + size-1 > area->phys_addr + 
> > area->size-1)
> > -   continue;
> > -   /* we can drop the lock here as we know *area is static */
> > -   read_unlock(&vmlist_lock);
> > -   addr = (unsigned long)area->addr;
> > -   addr += __pfn_to_phys(pfn) - area->phys_addr;
> > -   return (void __iomem *) (offset + addr);
> > +   if (size && !((sizeof(phys_addr_t) == 4 && pfn >= 0x10))) {
>  ^ ^
> You have a needless extra set of parents here.
 
Okay.

> [...]
> 
> > diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
> > index ce328c7..b2c0356 100644
> > --- a/arch/arm/mm/mmu.c
> > +++ b/arch/arm/mm/mmu.c
> > @@ -757,21 +757,24 @@ void __init iotable_init(struct map_desc *io_desc, 
> > int nr)
> >  {
> > struct map_desc *md;
> > struct vm_struct *vm;
> > +   struct static_vm *svm;
> >  
> > if (!nr)
> > return;
> >  
> > -   vm = early_alloc_aligned(sizeof(*vm) * nr, __alignof__(*vm));
> > +   svm = early_alloc_aligned(sizeof(*svm) * nr, __alignof__(*svm));
> >  
> > for (md = io_desc; nr; md++, nr--) {
> > create_mapping(md);
> > +
> > +   vm = &svm->vm;
> > vm->addr = (void *)(md->virtual & PAGE_MASK);
> > vm->size = PAGE_ALIGN(md->length + (md->virtual & ~PAGE_MASK));
> > vm->phys_addr = __pfn_to_phys(md->pfn);
> > vm->flags = VM_IOREMAP | VM_ARM_STATIC_MAPPING;
> > vm->flags |= VM_ARM_MTYPE(md->type);
> > vm->caller = iotable_init;
> > -   vm_area_add_early(vm++);
> > +   add_static_vm_early(svm++);
> > }
> >  }
> >  
> > @@ -779,13 +782,16 @@ void __init vm_reserve_area_early(unsigned long addr, 
> > unsigned long size,
> >   void *caller)
> >  {
> > struct vm_struct *vm;
> > +   struct static_vm *svm;
> > +
> > +   svm = early_alloc_aligned(sizeof(*svm), __alignof__(*svm));
> >  
> >

Re: [PATCH v5 3/3] ARM: mm: use static_vm for managing static mapped areas

2013-02-05 Thread Joonsoo Kim

Hello, Rob.

On Tue, Feb 05, 2013 at 01:12:51PM -0600, Rob Herring wrote:
> On 02/05/2013 12:13 PM, Nicolas Pitre wrote:
> > On Tue, 5 Feb 2013, Rob Herring wrote:
> > 
> >> On 02/04/2013 10:44 PM, Nicolas Pitre wrote:
> >>> On Tue, 5 Feb 2013, Joonsoo Kim wrote:
> >>>
> >>>> A static mapped area is ARM-specific, so it is better not to use
> >>>> generic vmalloc data structure, that is, vmlist and vmlist_lock
> >>>> for managing static mapped area. And it causes some needless overhead and
> >>>> reducing this overhead is better idea.
> >>>>
> >>>> Now, we have newly introduced static_vm infrastructure.
> >>>> With it, we don't need to iterate all mapped areas. Instead, we just
> >>>> iterate static mapped areas. It helps to reduce an overhead of finding
> >>>> matched area. And architecture dependency on vmalloc layer is removed,
> >>>> so it will help to maintainability for vmalloc layer.
> >>>>
> >>>> Signed-off-by: Joonsoo Kim 
> >>
> >> [snip]
> >>
> >>>> @@ -859,17 +864,12 @@ static void __init pci_reserve_io(void)
> >>>>  {
> >>>>  struct vm_struct *vm;
> >>>>  unsigned long addr;
> >>>> +struct static_vm *svm;
> >>>>  
> >>>> -/* we're still single threaded hence no lock needed here */
> >>>> -for (vm = vmlist; vm; vm = vm->next) {
> >>>> -if (!(vm->flags & VM_ARM_STATIC_MAPPING))
> >>>> -continue;
> >>>> -addr = (unsigned long)vm->addr;
> >>>> -addr &= ~(SZ_2M - 1);
> >>>> -if (addr == PCI_IO_VIRT_BASE)
> >>>> -return;
> >>>> +svm = find_static_vm_vaddr((void *)PCI_IO_VIRT_BASE);
> >>>> +if (svm)
> >>>> +return;
> >>>>
> >>>> -}
> >>>>  
> >>>>  vm_reserve_area_early(PCI_IO_VIRT_BASE, SZ_2M, pci_reserve_io);
> >>>>  }
> >>>
> >>> The replacement code is not equivalent.  I can't recall why the original 
> >>> is as it is, but it doesn't look right to me.  The 2MB round down 
> >>> certainly looks suspicious.
> >>
> >> The PCI mapping is at a fixed, aligned 2MB mapping. If we find any
> >> virtual address within that region already mapped, it is an error.
> > Ah, OK.  This wasn't clear looking at the code.
> >> We probably should have had a WARN here.
> > 
> > Indeed.
> > 

Okay.
I should fix it to find any mapping within PCI reserved region.
But, I think that it is not an error.
Now, I see your original commit 'c2794437091a4fda72c4a4f3567dd728dcc0c3c9'
and find below message.

"Platforms which need early i/o mapping (e.g. for vga console) can call
pci_map_io_early in their .map_io function."

Therfore, for some platform, it is possible that there is a mapping within
PCI reserved range.

So, I will not add WARN here.

I will fix and re-send v6 with your ACK.

Thanks for review.

> >>>
> >>> The replacement code should be better.  However I'd like you to get an 
> >>> ACK from Rob Herring as well for this patch.
> >>
> >> It doesn't appear to me the above case is handled. The virt addr is
> >> checked whether it is within an existing mapping, but not whether the
> >> new mapping would overlap an existing mapping. It would be good to check
> >> for this generically rather than specifically for the PCI i/o mapping.
> > 
> > Agreed.  However that is checked already in vm_area_add_early().  
> > Therefore the overlap test here is redundant.
> 
> Ah, right. In that case:
> 
> Acked-by: Rob Herring 
> 
> Rob
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 1/3] ARM: vmregion: remove vmregion code entirely

2013-02-05 Thread Joonsoo Kim

Hello, Santosh.

On Tue, Feb 05, 2013 at 02:22:39PM +0530, Santosh Shilimkar wrote:
> On Tuesday 05 February 2013 06:01 AM, Joonsoo Kim wrote:
> >Now, there is no user for vmregion.
> >So remove it.
> >
> >Acked-by: Nicolas Pitre 
> >Signed-off-by: Joonsoo Kim 
> >
> >diff --git a/arch/arm/mm/Makefile b/arch/arm/mm/Makefile
> >index 8a9c4cb..4e333fa 100644
> >--- a/arch/arm/mm/Makefile
> >+++ b/arch/arm/mm/Makefile
> >@@ -6,7 +6,7 @@ obj-y:= dma-mapping.o 
> >extable.o fault.o init.o \
> >iomap.o
> >
> >  obj-$(CONFIG_MMU)  += fault-armv.o flush.o idmap.o ioremap.o \
> >-   mmap.o pgd.o mmu.o vmregion.o
> >+   mmap.o pgd.o mmu.o
> >
> >  ifneq ($(CONFIG_MMU),y)
> >  obj-y  += nommu.o
> >diff --git a/arch/arm/mm/vmregion.c b/arch/arm/mm/vmregion.c
> >deleted file mode 100644
> >index a631016..000
> >--- a/arch/arm/mm/vmregion.c
> >+++ /dev/null
> >@@ -1,205 +0,0 @@
> You might want to use 'git format-patch -D'
> which will just generate one line for a deleted file.

Nice tip!
Thanks.

> Regards,
> Santosh
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 0/3] introduce static_vm for ARM-specific static mapped area

2013-02-05 Thread Joonsoo Kim

Hello, Santosh.

On Tue, Feb 05, 2013 at 02:32:06PM +0530, Santosh Shilimkar wrote:
> On Tuesday 05 February 2013 06:01 AM, Joonsoo Kim wrote:
> >In current implementation, we used ARM-specific flag, that is,
> >VM_ARM_STATIC_MAPPING, for distinguishing ARM specific static mapped area.
> >The purpose of static mapped area is to re-use static mapped area when
> >entire physical address range of the ioremap request can be covered
> >by this area.
> >
> >This implementation causes needless overhead for some cases.
> >For example, assume that there is only one static mapped area and
> >vmlist has 300 areas. Every time we call ioremap, we check 300 areas for
> >deciding whether it is matched or not. Moreover, even if there is
> >no static mapped area and vmlist has 300 areas, every time we call
> >ioremap, we check 300 areas in now.
> >
> >If we construct a extra list for static mapped area, we can eliminate
> >above mentioned overhead.
> >With a extra list, if there is one static mapped area,
> >we just check only one area and proceed next operation quickly.
> >
> >In fact, it is not a critical problem, because ioremap is not frequently
> >used. But reducing overhead is better idea.
> >
> >Another reason for doing this work is for removing vm_struct list management,
> >entirely. For more information, look at the following link.
> >http://lkml.org/lkml/2012/12/6/184
> >
> 
> [..]
> 
> >
> >Joonsoo Kim (3):
> >   ARM: vmregion: remove vmregion code entirely
> >   ARM: ioremap: introduce an infrastructure for static mapped area
> >   ARM: mm: use static_vm for managing static mapped areas
> >
> >  arch/arm/mm/Makefile   |2 +-
> >  arch/arm/mm/ioremap.c  |  135 +--
> >  arch/arm/mm/mm.h   |   12 +++
> >  arch/arm/mm/mmu.c  |   34 
> >  arch/arm/mm/vmregion.c |  205 
> > 
> >  arch/arm/mm/vmregion.h |   31 
> >  6 files changed, 123 insertions(+), 296 deletions(-)
> >  delete mode 100644 arch/arm/mm/vmregion.c
> >  delete mode 100644 arch/arm/mm/vmregion.h
> >
> Nice Clean-up. I tested this series on OMAP which uses few static
> mappings. Feel free to add,
> 
> Tested-by: Santosh Shilimkar

I will re-send v6 with your Tested-by.

Thanks for testing this.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 3/3] ARM: mm: use static_vm for managing static mapped areas

2013-02-05 Thread Joonsoo Kim

On Wed, Feb 06, 2013 at 11:07:07AM +0900, Joonsoo Kim wrote:
> Hello, Rob.
> 
> On Tue, Feb 05, 2013 at 01:12:51PM -0600, Rob Herring wrote:
> > On 02/05/2013 12:13 PM, Nicolas Pitre wrote:
> > > On Tue, 5 Feb 2013, Rob Herring wrote:
> > > 
> > >> On 02/04/2013 10:44 PM, Nicolas Pitre wrote:
> > >>> On Tue, 5 Feb 2013, Joonsoo Kim wrote:
> > >>>
> > >>>> A static mapped area is ARM-specific, so it is better not to use
> > >>>> generic vmalloc data structure, that is, vmlist and vmlist_lock
> > >>>> for managing static mapped area. And it causes some needless overhead 
> > >>>> and
> > >>>> reducing this overhead is better idea.
> > >>>>
> > >>>> Now, we have newly introduced static_vm infrastructure.
> > >>>> With it, we don't need to iterate all mapped areas. Instead, we just
> > >>>> iterate static mapped areas. It helps to reduce an overhead of finding
> > >>>> matched area. And architecture dependency on vmalloc layer is removed,
> > >>>> so it will help to maintainability for vmalloc layer.
> > >>>>
> > >>>> Signed-off-by: Joonsoo Kim 
> > >>
> > >> [snip]
> > >>
> > >>>> @@ -859,17 +864,12 @@ static void __init pci_reserve_io(void)
> > >>>>  {
> > >>>>struct vm_struct *vm;
> > >>>>unsigned long addr;
> > >>>> +  struct static_vm *svm;
> > >>>>  
> > >>>> -  /* we're still single threaded hence no lock needed here */
> > >>>> -  for (vm = vmlist; vm; vm = vm->next) {
> > >>>> -  if (!(vm->flags & VM_ARM_STATIC_MAPPING))
> > >>>> -  continue;
> > >>>> -  addr = (unsigned long)vm->addr;
> > >>>> -  addr &= ~(SZ_2M - 1);
> > >>>> -  if (addr == PCI_IO_VIRT_BASE)
> > >>>> -  return;
> > >>>> +  svm = find_static_vm_vaddr((void *)PCI_IO_VIRT_BASE);
> > >>>> +  if (svm)
> > >>>> +  return;
> > >>>>
> > >>>> -  }
> > >>>>  
> > >>>>vm_reserve_area_early(PCI_IO_VIRT_BASE, SZ_2M, pci_reserve_io);
> > >>>>  }
> > >>>
> > >>> The replacement code is not equivalent.  I can't recall why the 
> > >>> original 
> > >>> is as it is, but it doesn't look right to me.  The 2MB round down 
> > >>> certainly looks suspicious.
> > >>
> > >> The PCI mapping is at a fixed, aligned 2MB mapping. If we find any
> > >> virtual address within that region already mapped, it is an error.
> > > Ah, OK.  This wasn't clear looking at the code.
> > >> We probably should have had a WARN here.
> > > 
> > > Indeed.
> > > 
> 
> Okay.
> I should fix it to find any mapping within PCI reserved region.

Ah...
Above comment is my mistake.

If there is a region already mapped within PCI reserved region and
it is not found by find_static_vm_vaddr(), vm_area_add_early() hit BUG_ON().
So, to leave find_static_vm_vaddr() is safe.

> But, I think that it is not an error.
> Now, I see your original commit 'c2794437091a4fda72c4a4f3567dd728dcc0c3c9'
> and find below message.
> 
> "Platforms which need early i/o mapping (e.g. for vga console) can call
> pci_map_io_early in their .map_io function."
> 
> Therfore, for some platform, it is possible that there is a mapping within
> PCI reserved range.
> 
> So, I will not add WARN here.
> 
> I will fix and re-send v6 with your ACK.
> 
> Thanks for review.
> 
> > >>>
> > >>> The replacement code should be better.  However I'd like you to get an 
> > >>> ACK from Rob Herring as well for this patch.
> > >>
> > >> It doesn't appear to me the above case is handled. The virt addr is
> > >> checked whether it is within an existing mapping, but not whether the
> > >> new mapping would overlap an existing mapping. It would be good to check
> > >> for this generically rather than specifically for the PCI i/o mapping.
> > > 
> > > Agreed.  However that is checked already in vm_area_add_early().  
> > > Therefore the overlap test here is redundant.
> > 
> > Ah, right. In that case:
> > 
> > Acked-by: Rob Herring 
> > 
> > Rob
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RESEND] ARM: sched: correct update_sched_clock()

2013-02-05 Thread Joonsoo Kim

If we want load epoch_cyc and epoch_ns atomically,
we should update epoch_cyc_copy first of all.
This notify reader that updating is in progress.

If we update epoch_cyc first like as current implementation,
there is subtle error case.
Look at the below example.


cyc = 9
ns = 900
cyc_copy = 9

== CASE 1 ==
   
   write cyc = 10
read cyc = 10
read ns = 900
   write ns = 1000
   write cyc_copy = 10
read cyc_copy = 10

output = (10, 900)

== CASE 2 ==
   
read cyc = 9
   write cyc = 10
   write ns = 1000
read ns = 1000
read cyc_copy = 9
   write cyc_copy = 10
output = (9, 1000)

If atomic read is ensured, output should be (9, 900) or (10, 1000).
But, output in example case are not.

So, change updating sequence in order to correct this problem.

Signed-off-by: Joonsoo Kim 

diff --git a/arch/arm/kernel/sched_clock.c b/arch/arm/kernel/sched_clock.c
index fc6692e..bd6f56b 100644
--- a/arch/arm/kernel/sched_clock.c
+++ b/arch/arm/kernel/sched_clock.c
@@ -93,11 +93,11 @@ static void notrace update_sched_clock(void)
 * detectable in cyc_to_fixed_sched_clock().
 */
raw_local_irq_save(flags);
-   cd.epoch_cyc = cyc;
+   cd.epoch_cyc_copy = cyc;
smp_wmb();
cd.epoch_ns = ns;
smp_wmb();
-   cd.epoch_cyc_copy = cyc;
+   cd.epoch_cyc = cyc;
raw_local_irq_restore(flags);
 }
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v6 0/3] introduce static_vm for ARM-specific static mapped area

2013-02-06 Thread Joonsoo Kim

In current implementation, we used ARM-specific flag, that is,
VM_ARM_STATIC_MAPPING, for distinguishing ARM specific static mapped area.
The purpose of static mapped area is to re-use static mapped area when
entire physical address range of the ioremap request can be covered
by this area.

This implementation causes needless overhead for some cases.
For example, assume that there is only one static mapped area and
vmlist has 300 areas. Every time we call ioremap, we check 300 areas for
deciding whether it is matched or not. Moreover, even if there is
no static mapped area and vmlist has 300 areas, every time we call
ioremap, we check 300 areas in now.

If we construct a extra list for static mapped area, we can eliminate
above mentioned overhead.
With a extra list, if there is one static mapped area,
we just check only one area and proceed next operation quickly.

In fact, it is not a critical problem, because ioremap is not frequently
used. But reducing overhead is better idea.

Another reason for doing this work is for removing vm_struct list management,
entirely. For more information, look at the following link.
http://lkml.org/lkml/2012/12/6/184

Changelog
v5->v6:
  Add Ack-by, Reviewed-by, Tested-by tags
  [3/3]: Change from Nicolas' suggestion
 - remove redundant parenthesis

v4->v5:
  [2/3]: Changes from Nicolas' suggestion
 - don't use separate flags for static_vm
 - remove a lock
 - declare add_static_vm_early() as __init
  [3/3]: Changes from Nicolas' suggestion
 - add / leave comments

v3->v4:
  [2/3]: Changes from Nicolas' suggestion
 - embed static_vm code in ioremap.c
 - simplify struct static_vm
 - remove init_static_vm, instead, add_static_vm_early() init static_vm
 Use generic list for list management of static_vm
 Convert spin_lock to rw_lock
 Modify static_vm's flags bits
  [3/3]: Rework according to [2/3] change
  Rebased on v3.8-rc5

v2->v3:
  coverletter: refer a link related to this work
  [2/3]: drop @flags of find_static_vm_vaddr
  Rebased on v3.8-rc4

v1->v2: 
  [2/3]: patch description is improved.
  Rebased on v3.7-rc7

Joonsoo Kim (3):
  ARM: vmregion: remove vmregion code entirely
  ARM: ioremap: introduce an infrastructure for static mapped area
  ARM: mm: use static_vm for managing static mapped areas

 arch/arm/mm/Makefile   |2 +-
 arch/arm/mm/ioremap.c  |  135 +--
 arch/arm/mm/mm.h   |   12 +++
 arch/arm/mm/mmu.c  |   34 
 arch/arm/mm/vmregion.c |  205 
 arch/arm/mm/vmregion.h |   31 
 6 files changed, 123 insertions(+), 296 deletions(-)
 delete mode 100644 arch/arm/mm/vmregion.c
 delete mode 100644 arch/arm/mm/vmregion.h

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v6 2/3] ARM: ioremap: introduce an infrastructure for static mapped area

2013-02-06 Thread Joonsoo Kim

In current implementation, we used ARM-specific flag, that is,
VM_ARM_STATIC_MAPPING, for distinguishing ARM specific static mapped area.
The purpose of static mapped area is to re-use static mapped area when
entire physical address range of the ioremap request can be covered
by this area.

This implementation causes needless overhead for some cases.
For example, assume that there is only one static mapped area and
vmlist has 300 areas. Every time we call ioremap, we check 300 areas for
deciding whether it is matched or not. Moreover, even if there is
no static mapped area and vmlist has 300 areas, every time we call
ioremap, we check 300 areas in now.

If we construct a extra list for static mapped area, we can eliminate
above mentioned overhead.
With a extra list, if there is one static mapped area,
we just check only one area and proceed next operation quickly.

In fact, it is not a critical problem, because ioremap is not frequently
used. But reducing overhead is better idea.

Another reason for doing this work is for removing architecture dependency
on vmalloc layer. I think that vmlist and vmlist_lock is internal data
structure for vmalloc layer. Some codes for debugging and stat inevitably
use vmlist and vmlist_lock. But it is preferable that they are used
as least as possible in outside of vmalloc.c

Now, I introduce an ARM-specific infrastructure for static mapped area. In
the following patch, we will use this and resolve above mentioned problem.

Reviewed-by: Nicolas Pitre 
Tested-by: Santosh Shilimkar 
Signed-off-by: Joonsoo Kim 

diff --git a/arch/arm/mm/ioremap.c b/arch/arm/mm/ioremap.c
index 88fd86c..904c15e 100644
--- a/arch/arm/mm/ioremap.c
+++ b/arch/arm/mm/ioremap.c
@@ -39,6 +39,70 @@
 #include 
 #include "mm.h"
 
+
+LIST_HEAD(static_vmlist);
+
+static struct static_vm *find_static_vm_paddr(phys_addr_t paddr,
+   size_t size, unsigned int mtype)
+{
+   struct static_vm *svm;
+   struct vm_struct *vm;
+
+   list_for_each_entry(svm, &static_vmlist, list) {
+   vm = &svm->vm;
+   if (!(vm->flags & VM_ARM_STATIC_MAPPING))
+   continue;
+   if ((vm->flags & VM_ARM_MTYPE_MASK) != VM_ARM_MTYPE(mtype))
+   continue;
+
+   if (vm->phys_addr > paddr ||
+   paddr + size - 1 > vm->phys_addr + vm->size - 1)
+   continue;
+
+   return svm;
+   }
+
+   return NULL;
+}
+
+struct static_vm *find_static_vm_vaddr(void *vaddr)
+{
+   struct static_vm *svm;
+   struct vm_struct *vm;
+
+   list_for_each_entry(svm, &static_vmlist, list) {
+   vm = &svm->vm;
+
+   /* static_vmlist is ascending order */
+   if (vm->addr > vaddr)
+   break;
+
+   if (vm->addr <= vaddr && vm->addr + vm->size > vaddr)
+   return svm;
+   }
+
+   return NULL;
+}
+
+void __init add_static_vm_early(struct static_vm *svm)
+{
+   struct static_vm *curr_svm;
+   struct vm_struct *vm;
+   void *vaddr;
+
+   vm = &svm->vm;
+   vm_area_add_early(vm);
+   vaddr = vm->addr;
+
+   list_for_each_entry(curr_svm, &static_vmlist, list) {
+   vm = &curr_svm->vm;
+
+   if (vm->addr > vaddr)
+   break;
+   }
+   list_add_tail(&svm->list, &curr_svm->list);
+}
+
 int ioremap_page(unsigned long virt, unsigned long phys,
 const struct mem_type *mtype)
 {
diff --git a/arch/arm/mm/mm.h b/arch/arm/mm/mm.h
index a8ee92d..d5a4e9a 100644
--- a/arch/arm/mm/mm.h
+++ b/arch/arm/mm/mm.h
@@ -1,4 +1,6 @@
 #ifdef CONFIG_MMU
+#include 
+#include 
 
 /* the upper-most page table pointer */
 extern pmd_t *top_pmd;
@@ -65,6 +67,16 @@ extern void __flush_dcache_page(struct address_space 
*mapping, struct page *page
 /* consistent regions used by dma_alloc_attrs() */
 #define VM_ARM_DMA_CONSISTENT  0x2000
 
+
+struct static_vm {
+   struct vm_struct vm;
+   struct list_head list;
+};
+
+extern struct list_head static_vmlist;
+extern struct static_vm *find_static_vm_vaddr(void *vaddr);
+extern __init void add_static_vm_early(struct static_vm *svm);
+
 #endif
 
 #ifdef CONFIG_ZONE_DMA
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v6 1/3] ARM: vmregion: remove vmregion code entirely

2013-02-06 Thread Joonsoo Kim

Now, there is no user for vmregion.
So remove it.

Acked-by: Nicolas Pitre 
Tested-by: Santosh Shilimkar 
Signed-off-by: Joonsoo Kim 

diff --git a/arch/arm/mm/Makefile b/arch/arm/mm/Makefile
index 8a9c4cb..4e333fa 100644
--- a/arch/arm/mm/Makefile
+++ b/arch/arm/mm/Makefile
@@ -6,7 +6,7 @@ obj-y   := dma-mapping.o extable.o 
fault.o init.o \
   iomap.o
 
 obj-$(CONFIG_MMU)  += fault-armv.o flush.o idmap.o ioremap.o \
-  mmap.o pgd.o mmu.o vmregion.o
+  mmap.o pgd.o mmu.o
 
 ifneq ($(CONFIG_MMU),y)
 obj-y  += nommu.o
diff --git a/arch/arm/mm/vmregion.c b/arch/arm/mm/vmregion.c
deleted file mode 100644
index a631016..000
diff --git a/arch/arm/mm/vmregion.h b/arch/arm/mm/vmregion.h
deleted file mode 100644
index 0f5a5f2..000
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v6 3/3] ARM: mm: use static_vm for managing static mapped areas

2013-02-06 Thread Joonsoo Kim

A static mapped area is ARM-specific, so it is better not to use
generic vmalloc data structure, that is, vmlist and vmlist_lock
for managing static mapped area. And it causes some needless overhead and
reducing this overhead is better idea.

Now, we have newly introduced static_vm infrastructure.
With it, we don't need to iterate all mapped areas. Instead, we just
iterate static mapped areas. It helps to reduce an overhead of finding
matched area. And architecture dependency on vmalloc layer is removed,
so it will help to maintainability for vmalloc layer.

Reviewed-by: Nicolas Pitre 
Acked-by: Rob Herring 
Tested-by: Santosh Shilimkar 
Signed-off-by: Joonsoo Kim 

diff --git a/arch/arm/mm/ioremap.c b/arch/arm/mm/ioremap.c
index 904c15e..04d9006 100644
--- a/arch/arm/mm/ioremap.c
+++ b/arch/arm/mm/ioremap.c
@@ -261,13 +261,14 @@ void __iomem * __arm_ioremap_pfn_caller(unsigned long pfn,
const struct mem_type *type;
int err;
unsigned long addr;
-   struct vm_struct * area;
+   struct vm_struct *area;
+   phys_addr_t paddr = __pfn_to_phys(pfn);
 
 #ifndef CONFIG_ARM_LPAE
/*
 * High mappings must be supersection aligned
 */
-   if (pfn >= 0x10 && (__pfn_to_phys(pfn) & ~SUPERSECTION_MASK))
+   if (pfn >= 0x10 && (paddr & ~SUPERSECTION_MASK))
return NULL;
 #endif
 
@@ -283,24 +284,16 @@ void __iomem * __arm_ioremap_pfn_caller(unsigned long pfn,
/*
 * Try to reuse one of the static mapping whenever possible.
 */
-   read_lock(&vmlist_lock);
-   for (area = vmlist; area; area = area->next) {
-   if (!size || (sizeof(phys_addr_t) == 4 && pfn >= 0x10))
-   break;
-   if (!(area->flags & VM_ARM_STATIC_MAPPING))
-   continue;
-   if ((area->flags & VM_ARM_MTYPE_MASK) != VM_ARM_MTYPE(mtype))
-   continue;
-   if (__phys_to_pfn(area->phys_addr) > pfn ||
-   __pfn_to_phys(pfn) + size-1 > area->phys_addr + 
area->size-1)
-   continue;
-   /* we can drop the lock here as we know *area is static */
-   read_unlock(&vmlist_lock);
-   addr = (unsigned long)area->addr;
-   addr += __pfn_to_phys(pfn) - area->phys_addr;
-   return (void __iomem *) (offset + addr);
+   if (size && !(sizeof(phys_addr_t) == 4 && pfn >= 0x10)) {
+   struct static_vm *svm;
+
+   svm = find_static_vm_paddr(paddr, size, mtype);
+   if (svm) {
+   addr = (unsigned long)svm->vm.addr;
+   addr += paddr - svm->vm.phys_addr;
+   return (void __iomem *) (offset + addr);
+   }
}
-   read_unlock(&vmlist_lock);
 
/*
 * Don't allow RAM to be mapped - this causes problems with ARMv6+
@@ -312,21 +305,21 @@ void __iomem * __arm_ioremap_pfn_caller(unsigned long pfn,
if (!area)
return NULL;
addr = (unsigned long)area->addr;
-   area->phys_addr = __pfn_to_phys(pfn);
+   area->phys_addr = paddr;
 
 #if !defined(CONFIG_SMP) && !defined(CONFIG_ARM_LPAE)
if (DOMAIN_IO == 0 &&
(((cpu_architecture() >= CPU_ARCH_ARMv6) && (get_cr() & CR_XP)) ||
   cpu_is_xsc3()) && pfn >= 0x10 &&
-  !((__pfn_to_phys(pfn) | size | addr) & ~SUPERSECTION_MASK)) {
+  !((paddr | size | addr) & ~SUPERSECTION_MASK)) {
area->flags |= VM_ARM_SECTION_MAPPING;
err = remap_area_supersections(addr, pfn, size, type);
-   } else if (!((__pfn_to_phys(pfn) | size | addr) & ~PMD_MASK)) {
+   } else if (!((paddr | size | addr) & ~PMD_MASK)) {
area->flags |= VM_ARM_SECTION_MAPPING;
err = remap_area_sections(addr, pfn, size, type);
} else
 #endif
-   err = ioremap_page_range(addr, addr + size, __pfn_to_phys(pfn),
+   err = ioremap_page_range(addr, addr + size, paddr,
 __pgprot(type->prot_pte));
 
if (err) {
@@ -410,34 +403,28 @@ __arm_ioremap_exec(unsigned long phys_addr, size_t size, 
bool cached)
 void __iounmap(volatile void __iomem *io_addr)
 {
void *addr = (void *)(PAGE_MASK & (unsigned long)io_addr);
-   struct vm_struct *vm;
+   struct static_vm *svm;
+
+   /* If this is a static mapping, we must leave it alone */
+   svm = find_static_vm_vaddr(addr);
+   if (svm)
+   return;
 
-   read_lock(&vmlist_lock);
-   for (vm = vmlist; vm; vm = vm->next) {
-   if (vm->addr > addr)
-

Re: [PATCH RESEND] ARM: sched: correct update_sched_clock()

2013-02-06 Thread JoonSoo Kim

Hello, Linus.

2013/2/6 Linus Walleij :
> On Wed, Feb 6, 2013 at 6:21 AM, Joonsoo Kim  wrote:
>
>> If we want load epoch_cyc and epoch_ns atomically,
>> we should update epoch_cyc_copy first of all.
>> This notify reader that updating is in progress.
>
> If you think the patch is finished, put it into Russell's patch tracker:
> http://www.arm.linux.org.uk/developer/patches/

Ah... Okay.
Thanks for notifying me.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RESEND] ARM: sched: correct update_sched_clock()

2013-02-07 Thread Joonsoo Kim

Hello, Russell.

On Wed, Feb 06, 2013 at 04:33:55PM +, Russell King - ARM Linux wrote:
> On Wed, Feb 06, 2013 at 10:33:53AM +0100, Linus Walleij wrote:
> > On Wed, Feb 6, 2013 at 6:21 AM, Joonsoo Kim  wrote:
> > 
> > > If we want load epoch_cyc and epoch_ns atomically,
> > > we should update epoch_cyc_copy first of all.
> > > This notify reader that updating is in progress.
> > 
> > If you think the patch is finished, put it into Russell's patch tracker:
> > http://www.arm.linux.org.uk/developer/patches/
> 
> Yea, this patch looks like the right thing... and yes, into the patch
> system please.

I try to put it into patch tracker, but I fail to put it.
I use following command.

git send-email --to patc...@arm.linux.org.uk 0001-ARM-sched-correct~

Am I wrong?
Could you teach me how to put patch into your patch tracker?
I already read "help" on your website, but I can't find any stuff for
"git send-email".

Thanks.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RESEND] ARM: sched: correct update_sched_clock()

2013-02-08 Thread JoonSoo Kim

2013/2/9 Nicolas Pitre :
> On Fri, 8 Feb 2013, Russell King - ARM Linux wrote:
>
>> On Fri, Feb 08, 2013 at 03:51:25PM +0900, Joonsoo Kim wrote:
>> > I try to put it into patch tracker, but I fail to put it.
>> > I use following command.
>> >
>> > git send-email --to patc...@arm.linux.org.uk 0001-ARM-sched-correct~
>> >
>> > Am I wrong?
>> > Could you teach me how to put patch into your patch tracker?
>> > I already read "help" on your website, but I can't find any stuff for
>> > "git send-email".
>>
>> I've never used git to send email, so I'm afraid I can't help with that.
>> Maybe someone with more experience with sending email from git can
>> comment?
>
> The commit text for the patch needs to contain the following 2 lines at
> the very bottom in order to match the patch system's expectations:
>
> PATCH FOLLOWS
> KernelVersion: v3.8
>
> The KernelVersion value needs to be adjusted of course.
>
> Note: I never tried it myself but it ought to work.

Yes It works!
Thanks!!!

> Nicolas
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()

2013-04-02 Thread Joonsoo Kim

Hello, Preeti.

On Tue, Apr 02, 2013 at 10:25:23AM +0530, Preeti U Murthy wrote:
> Hi Joonsoo,
> 
> On 04/02/2013 07:55 AM, Joonsoo Kim wrote:
> > Hello, Preeti.
> > 
> > On Mon, Apr 01, 2013 at 12:36:52PM +0530, Preeti U Murthy wrote:
> >> Hi Joonsoo,
> >>
> >> On 04/01/2013 09:38 AM, Joonsoo Kim wrote:
> >>> Hello, Preeti.
> >>>
> >>
> >>>>
> >>>> Ideally the children's cpu share must add upto the parent's share.
> >>>>
> >>>
> >>> I don't think so.
> >>>
> >>> We should schedule out the parent tg if 5ms is over. As we do so, we can
> >>> fairly distribute time slice to every tg within short term. If we add
> >>> the children's cpu share upto the parent's, the parent tg may have
> >>> large time slice, so it cannot be preempted easily. There may be a latency
> >>> problem if there are many tgs.
> >>
> >> In the case where the #running < sched_nr_latency, the children's
> >> sched_slices add up to the parent's.
> >>
> >> A rq with two tgs,each with 3 tasks.
> >>
> >> Each of these tasks have a sched slice of
> >> [(sysctl_sched_latency / 3) / 2] as of the present implementation.
> >>
> >> The sum of the above sched_slice of all tasks of a tg will lead to the
> >> sched_slice of its parent: sysctl_sched_latency / 2
> >>
> >> This breaks when the nr_running on each tg > sched_nr_latency. However I
> >> don't know if this is a good thing or a bad thing.
> > 
> > Ah.. Now I get your point. Yes, you are right and it may be good thing.
> > With that property, all tasks in the system can be scheduled at least once
> > in sysctl_sched_latency. sysctl_sched_latency is system-wide configuration,
> > so my patch may be wrong. With my patch, all tasks in the system cannot be
> > scheduled at least once in sysctl_sched_latency. Instead, it schedule
> > all tasks in cfs_rq at least once in sysctl_sched_latency if there is
> > no other tgs.
> 
> Exactly. You have got all the above points right.
> 
> > 
> > I think that it is real problem that sysctl_sched_min_granularity is not
> > guaranteed for each task.
> > Instead of this patch, how about considering low bound?
> > 
> > if (slice < sysctl_sched_min_granularity)
> > slice = sysctl_sched_min_granularity;
> 
> Consider the below scenario.
> 
> A runqueue has two task groups,each with 10 tasks.
> 
> With the current implementation,each of these tasks get a sched_slice of
> 2ms.Hence in a matter of (10*2) + (10*2) = 40 ms, all tasks( all tasks
> of both the task groups) will get the chance to run.
> 
> But what is the scheduling period in this scenario? Is it 40ms (extended
> sysctl_sched_latency), which is the scheduling period for each of the
> runqueues with 10 tasks in it?
> Or is it 80ms which is the total of the scheduling periods of each of
> the run queues with 10 tasks.Either way all tasks seem to get scheduled
> atleast once within the scheduling period.So we appear to be safe.
> Although the sched_slice < sched_min_granularity.
> 
> With your above lower bound of sysctl_sched_min_granularity, each task
> of each tg gets 4ms as its sched_slice.So in a matter of
> (10*4) + (10*4) = 80ms,all tasks get to run. With the above question
> being put forth here as well, we don't appear to be safe if the
> scheduling_period is considered to be 40ms, otherwise it appears fine to
> me, because it ensures the sched_slice is atleast sched_min_granularity
> no matter what.

So, you mean that we should guarantee to schedule each task atleast once
in sysctl_sched_latency? But this is not guaranteed in current code,
this is why I made this patch. Please refer a patch description.

Thanks.

> 
> Thank you
> 
> Regards
> Preeti U Murthy
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()

2013-04-02 Thread Joonsoo Kim

Hello, Mike.

On Tue, Apr 02, 2013 at 04:35:26AM +0200, Mike Galbraith wrote:
> On Tue, 2013-04-02 at 11:25 +0900, Joonsoo Kim wrote: 
> > Hello, Preeti.
> > 
> > On Mon, Apr 01, 2013 at 12:36:52PM +0530, Preeti U Murthy wrote:
> > > Hi Joonsoo,
> > > 
> > > On 04/01/2013 09:38 AM, Joonsoo Kim wrote:
> > > > Hello, Preeti.
> > > > 
> > > 
> > > >>
> > > >> Ideally the children's cpu share must add upto the parent's share.
> > > >>
> > > > 
> > > > I don't think so.
> > > > 
> > > > We should schedule out the parent tg if 5ms is over. As we do so, we can
> > > > fairly distribute time slice to every tg within short term. If we add
> > > > the children's cpu share upto the parent's, the parent tg may have
> > > > large time slice, so it cannot be preempted easily. There may be a 
> > > > latency
> > > > problem if there are many tgs.
> > > 
> > > In the case where the #running < sched_nr_latency, the children's
> > > sched_slices add up to the parent's.
> > > 
> > > A rq with two tgs,each with 3 tasks.
> > > 
> > > Each of these tasks have a sched slice of
> > > [(sysctl_sched_latency / 3) / 2] as of the present implementation.
> > > 
> > > The sum of the above sched_slice of all tasks of a tg will lead to the
> > > sched_slice of its parent: sysctl_sched_latency / 2
> > > 
> > > This breaks when the nr_running on each tg > sched_nr_latency. However I
> > > don't know if this is a good thing or a bad thing.
> > 
> > Ah.. Now I get your point. Yes, you are right and it may be good thing.
> > With that property, all tasks in the system can be scheduled at least once
> > in sysctl_sched_latency. sysctl_sched_latency is system-wide configuration,
> > so my patch may be wrong. With my patch, all tasks in the system cannot be
> > scheduled at least once in sysctl_sched_latency. Instead, it schedule
> > all tasks in cfs_rq at least once in sysctl_sched_latency if there is
> > no other tgs.
> > 
> > I think that it is real problem that sysctl_sched_min_granularity is not
> > guaranteed for each task.
> > Instead of this patch, how about considering low bound?
> > 
> > if (slice < sysctl_sched_min_granularity)
> > slice = sysctl_sched_min_granularity;
> 
> How many SCHED_IDLE or +nice tasks will fit in that?

It is more related to how many running tasks in cfs_rq and how many tg is
in the system. If we have two tgs which have more than sched_nr_latency
tasks, all these tasks fit in this condition in current implementation.

Thanks.

> 
> -Mike
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/5] sched: factor out code to should_we_balance()

2013-04-02 Thread Joonsoo Kim

Hello, Peter.

On Tue, Apr 02, 2013 at 10:10:06AM +0200, Peter Zijlstra wrote:
> On Thu, 2013-03-28 at 16:58 +0900, Joonsoo Kim wrote:
> > Now checking that this cpu is appropriate to balance is embedded into
> > update_sg_lb_stats() and this checking has no direct relationship to
> > this
> > function.
> > 
> > There is not enough reason to place this checking at
> > update_sg_lb_stats(),
> > except saving one iteration for sched_group_cpus.
> 
> Its only one iteration if there's only 2 groups, but there can be more
> than 2, take any desktop Intel i7, it will have 4-8 cores, each with
> HT; thus the CPU domain will have 4-8 groups.
> 
> And note that local_group is always the first group of a domain, so
> we'd stop the balance at the first group and avoid touching the other
> 3-7, avoiding touching cachelines on 6-14 cpus.
> 
> So this short-cut does make sense.. its not pretty, granted, but
> killing it doesn't seem right.

It seems that there is some misunderstanding about this patch.
In this patch, we don't iterate all groups. Instead, we iterate on
cpus of local sched_group only. So there is no penalty you mentioned.

In summary, net effect is below.

* For cpus which are not proper to balance,
Reduce function call,
Reduce memset

* For cpus which should balance
Extra one iteration on cpus of local sched_group in should_we_balance()
Reduce some branch, so expect better optimization

Thanks.

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()

2013-04-03 Thread Joonsoo Kim

Hello, Preeti.

On Tue, Apr 02, 2013 at 11:02:43PM +0530, Preeti U Murthy wrote:
> Hi Joonsoo,
> 
> 
> >>> I think that it is real problem that sysctl_sched_min_granularity is not
> >>> guaranteed for each task.
> >>> Instead of this patch, how about considering low bound?
> >>>
> >>> if (slice < sysctl_sched_min_granularity)
> >>>   slice = sysctl_sched_min_granularity;
> >>
> >> Consider the below scenario.
> >>
> >> A runqueue has two task groups,each with 10 tasks.
> >>
> >> With the current implementation,each of these tasks get a sched_slice of
> >> 2ms.Hence in a matter of (10*2) + (10*2) = 40 ms, all tasks( all tasks
> >> of both the task groups) will get the chance to run.
> >>
> >> But what is the scheduling period in this scenario? Is it 40ms (extended
> >> sysctl_sched_latency), which is the scheduling period for each of the
> >> runqueues with 10 tasks in it?
> >> Or is it 80ms which is the total of the scheduling periods of each of
> >> the run queues with 10 tasks.Either way all tasks seem to get scheduled
> >> atleast once within the scheduling period.So we appear to be safe.
> >> Although the sched_slice < sched_min_granularity.
> >>
> >> With your above lower bound of sysctl_sched_min_granularity, each task
> >> of each tg gets 4ms as its sched_slice.So in a matter of
> >> (10*4) + (10*4) = 80ms,all tasks get to run. With the above question
> >> being put forth here as well, we don't appear to be safe if the
> >> scheduling_period is considered to be 40ms, otherwise it appears fine to
> >> me, because it ensures the sched_slice is atleast sched_min_granularity
> >> no matter what.
> > 
> > So, you mean that we should guarantee to schedule each task atleast once
> > in sysctl_sched_latency? 
> 
> I would rather say all tasks get to run atleast once in a sched_period,
> which could be just sysctl_sched_latency or more depending on the number
> of tasks in the current implementation.
> 
> But this is not guaranteed in current code,
> > this is why I made this patch. Please refer a patch description.
> 
> Ok,take the example of a runqueue with 2 task groups,each with 10
> tasks.Same as your previous example. Can you explain how your patch
> ensures that all 20 tasks get to run atleast once in a sched_period?

My patch doesn't ensure that :)
I just want to say a problem of current implementation which can't
ensure to run atleast once in sched_period through my patch description.

So, how about extending a sched_period with rq->nr_running, instead of
cfs_rq->nr_running? It is my quick thought and I think that we can ensure
to run atleast once in this extending sched_period.

And, do we leave a problem if we cannot guaranteed atleast once property?

Thanks.

> 
> Regards
> Preeti U Murthy
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/5] sched: factor out code to should_we_balance()

2013-04-03 Thread Joonsoo Kim

Hello, Peter.

On Tue, Apr 02, 2013 at 12:29:42PM +0200, Peter Zijlstra wrote:
> On Tue, 2013-04-02 at 12:00 +0200, Peter Zijlstra wrote:
> > On Tue, 2013-04-02 at 18:50 +0900, Joonsoo Kim wrote:
> > > 
> > > It seems that there is some misunderstanding about this patch.
> > > In this patch, we don't iterate all groups. Instead, we iterate on
> > > cpus of local sched_group only. So there is no penalty you mentioned.
> > 
> > OK, I'll go stare at it again..
> 
> Ah, I see, you're doing should_we_balance() _before_
> find_busiest_group() and instead you're doing another for_each_cpu() in
> there.
> 
> I'd write the thing like:
> 
> static bool should_we_balance(struct lb_env *env)
> {
>   struct sched_group *sg = env->sd->groups;
>   struct cpumask *sg_cpus, *sg_mask;
>   int cpu, balance_cpu = -1;
> 
>   if (env->idle == CPU_NEWLY_IDLE)
>   return true;
> 
>   sg_cpus = sched_group_cpus(sg);
>   sg_mask = sched_group_mask(sg);
> 
>   for_each_cpu_and(cpu, sg_cpus, env->cpus) {
>   if (!cpumask_test_cpu(cpu, sg_mask))
>   continue;
> 
>   if (!idle_cpu(cpu))
>   continue;
> 
>   balance_cpu = cpu;
>   break;
>   }
> 
>   if (balance_cpu == -1)
>   balance_cpu = group_balance_cpu(sg);
> 
>   return balance_cpu == env->dst_cpu;
> }

Okay. It looks nice.

> 
> I also considered doing the group_balance_cpu() first to avoid having
> to do the idle_cpu() scan, but that's a slight behavioural change
> afaict.

In my quick thought, we can avoid it through below way.

balance_cpu = group_balance_cpu(sg);
if (idle_cpu(balance_cpu))
return balance_cpu == env->dst_cpu;
else
do idle_cpus() scan loop

Is it your idea? If not, please let me know your idea.

Thanks.

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RT LATENCY] 249 microsecond latency caused by slub's unfreeze_partials() code.

2013-04-03 Thread Joonsoo Kim

Hello, Christoph.

On Tue, Apr 02, 2013 at 07:25:20PM +, Christoph Lameter wrote:
> On Tue, 2 Apr 2013, Joonsoo Kim wrote:
> 
> > We need one more fix for correctness.
> > When available is assigned by put_cpu_partial, it doesn't count cpu slab's 
> > objects.
> > Please reference my old patch.
> >
> > https://lkml.org/lkml/2013/1/21/64
> 
> Could you update your patch and submit it again?

Pekka alreay applied it.
Do we need update?

Thanks.

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 1/4] kprobes: delay blacklist symbol lookup until we actually need it

2013-04-04 Thread Joonsoo Kim

Hello, Oskar.

On Thu, Apr 04, 2013 at 02:51:26PM +0200, Oskar Andero wrote:
> From: Toby Collett 
> 
> The symbol lookup can take a long time and kprobes is
> initialised very early in boot, so delay symbol lookup
> until the blacklist is first used.
> 
> Cc: Masami Hiramatsu 
> Cc: David S. Miller 
> Reviewed-by: Radovan Lekanovic 
> Signed-off-by: Toby Collett 
> Signed-off-by: Oskar Andero 
> ---
>  kernel/kprobes.c | 98 
> ++--
>  1 file changed, 60 insertions(+), 38 deletions(-)
> 
> diff --git a/kernel/kprobes.c b/kernel/kprobes.c
> index e35be53..0a270e5 100644
> --- a/kernel/kprobes.c
> +++ b/kernel/kprobes.c
> @@ -68,6 +68,7 @@
>  #endif
>  
>  static int kprobes_initialized;
> +static int kprobe_blacklist_initialized;
>  static struct hlist_head kprobe_table[KPROBE_TABLE_SIZE];
>  static struct hlist_head kretprobe_inst_table[KPROBE_TABLE_SIZE];
>  
> @@ -102,6 +103,60 @@ static struct kprobe_blackpoint kprobe_blacklist[] = {
>   {NULL}/* Terminator */
>  };
>  
> +/* it can take some time ( > 100ms ) to initialise the
> + * blacklist so we delay this until we actually need it
> + */
> +static void init_kprobe_blacklist(void)
> +{
> + int i;
> + unsigned long offset = 0, size = 0;
> + char *modname, namebuf[128];
> + const char *symbol_name;
> + void *addr;
> + struct kprobe_blackpoint *kb;
> +
> + mutex_lock(&kprobe_mutex);
> + if (kprobe_blacklist_initialized)
> + goto out;
> +
> + /*
> +  * Lookup and populate the kprobe_blacklist.
> +  *
> +  * Unlike the kretprobe blacklist, we'll need to determine
> +  * the range of addresses that belong to the said functions,
> +  * since a kprobe need not necessarily be at the beginning
> +  * of a function.
> +  */
> + for (kb = kprobe_blacklist; kb->name != NULL; kb++) {
> + kprobe_lookup_name(kb->name, addr);
> + if (!addr)
> + continue;
> +
> + kb->start_addr = (unsigned long)addr;
> + symbol_name = kallsyms_lookup(kb->start_addr,
> + &size, &offset, &modname, namebuf);
> + if (!symbol_name)
> + kb->range = 0;
> + else
> + kb->range = size;
> + }
> +
> + if (kretprobe_blacklist_size) {
> + /* lookup the function address from its name */
> + for (i = 0; kretprobe_blacklist[i].name != NULL; i++) {
> + kprobe_lookup_name(kretprobe_blacklist[i].name,
> +kretprobe_blacklist[i].addr);
> + if (!kretprobe_blacklist[i].addr)
> + printk("kretprobe: lookup failed: %s\n",
> +kretprobe_blacklist[i].name);
> + }
> + }
> + kprobe_blacklist_initialized = 1;

You need smp_wmb() before assigning 'kprobe_blacklist_initialized = 1'.
This guarantee that who see kprobe_blacklist_initialized = 1 will get
updated data of kprobe_blacklist.

Please refer my previous patch once more :)

And How about define kprobe_blacklist_initialized as boolean?

Thanks.

> +
> +out:
> + mutex_unlock(&kprobe_mutex);
> +}
> +
>  #ifdef __ARCH_WANT_KPROBES_INSN_SLOT
>  /*
>   * kprobe->ainsn.insn points to the copy of the instruction to be
> @@ -1331,6 +1386,9 @@ static int __kprobes in_kprobes_functions(unsigned long 
> addr)
>   if (addr >= (unsigned long)__kprobes_text_start &&
>   addr < (unsigned long)__kprobes_text_end)
>   return -EINVAL;
> +
> + if (unlikely(!kprobe_blacklist_initialized))
> + init_kprobe_blacklist();
>   /*
>* If there exists a kprobe_blacklist, verify and
>* fail any probe registration in the prohibited area
> @@ -1816,6 +1874,8 @@ int __kprobes register_kretprobe(struct kretprobe *rp)
>   void *addr;
>  
>   if (kretprobe_blacklist_size) {
> + if (unlikely(!kprobe_blacklist_initialized))
> + init_kprobe_blacklist();
>   addr = kprobe_addr(&rp->kp);
>   if (IS_ERR(addr))
>   return PTR_ERR(addr);
> @@ -2065,11 +2125,6 @@ static struct notifier_block kprobe_module_nb = {
>  static int __init init_kprobes(void)
>  {
>   int i, err = 0;
> - unsigned long offset = 0, size = 0;
> - char *modname, namebuf[128];
> - const char *symbol_name;
> - void *addr;
> - struct kprobe_blackpoint *kb;
>  
>   /* FIXME allocate the probe table, currently defined statically */
>   /* initialize all list heads */
> @@ -2079,39 +2134,6 @@ static int __init init_kprobes(void)
>   raw_spin_lock_init(&(kretprobe_table_locks[i].lock));
>   }
>  
> - /*
> -  * Lookup and populate the kprobe_blacklist.
> -  *
> -  * Unlike the kretprobe blacklist, we'll need to determine
> -  * the range of addre

Re: [RT LATENCY] 249 microsecond latency caused by slub's unfreeze_partials() code.

2013-04-04 Thread Joonsoo Kim

On Thu, Apr 04, 2013 at 01:53:25PM +, Christoph Lameter wrote:
> On Thu, 4 Apr 2013, Joonsoo Kim wrote:
> 
> > Pekka alreay applied it.
> > Do we need update?
> 
> Well I thought the passing of the count via lru.next would be something
> worthwhile to pick up.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

Hello, Pekka.

Here goes a patch implementing Christoph's idea.
Instead of updating my previous patch, I re-write this patch on top of
your slab/next tree.

Thanks.

8<---
>From e1c18793dd2a9d9cef87b07faf975364b71276d7 Mon Sep 17 00:00:00 2001
From: Joonsoo Kim 
Date: Fri, 5 Apr 2013 10:49:36 +0900
Subject: [PATCH] slub: use page->lru.next to calculate nr of acquired object

We can pass inuse count via page->lru.next in order to calculate number
of acquired objects and it is more beautiful way. This reduces one
function argument and makes clean code.

Cc: Christoph Lameter 
Suggested-by: Christoph Lameter 
Signed-off-by: Joonsoo Kim 

diff --git a/mm/slub.c b/mm/slub.c
index 21b3f00..8a35464 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1493,11 +1493,12 @@ static inline void remove_partial(struct 
kmem_cache_node *n,
  */
 static inline void *acquire_slab(struct kmem_cache *s,
struct kmem_cache_node *n, struct page *page,
-   int mode, int *objects)
+   int mode)
 {
void *freelist;
unsigned long counters;
struct page new;
+   unsigned long inuse;
 
/*
 * Zap the freelist and set the frozen bit.
@@ -1507,7 +1508,7 @@ static inline void *acquire_slab(struct kmem_cache *s,
freelist = page->freelist;
counters = page->counters;
new.counters = counters;
-   *objects = new.objects - new.inuse;
+   inuse = page->inuse;
if (mode) {
new.inuse = page->objects;
new.freelist = NULL;
@@ -1525,6 +1526,7 @@ static inline void *acquire_slab(struct kmem_cache *s,
return NULL;
 
remove_partial(n, page);
+   page->lru.next = (void *)inuse;
WARN_ON(!freelist);
return freelist;
 }
@@ -1541,7 +1543,6 @@ static void *get_partial_node(struct kmem_cache *s, 
struct kmem_cache_node *n,
struct page *page, *page2;
void *object = NULL;
int available = 0;
-   int objects;
 
/*
 * Racy check. If we mistakenly see no partial slabs then we
@@ -1559,11 +1560,11 @@ static void *get_partial_node(struct kmem_cache *s, 
struct kmem_cache_node *n,
if (!pfmemalloc_match(page, flags))
continue;
 
-   t = acquire_slab(s, n, page, object == NULL, &objects);
+   t = acquire_slab(s, n, page, object == NULL);
if (!t)
break;
 
-   available += objects;
+   available += (page->objects - (unsigned long)page->lru.next);
if (!object) {
c->page = page;
stat(s, ALLOC_FROM_PARTIAL);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/5] sched: don't consider upper se in sched_slice()

2013-04-04 Thread Joonsoo Kim

Hello, Preeti.

On Thu, Apr 04, 2013 at 12:18:32PM +0530, Preeti U Murthy wrote:
> Hi Joonsoo,
> 
> On 04/04/2013 06:12 AM, Joonsoo Kim wrote:
> > Hello, Preeti.
> 
> > 
> > So, how about extending a sched_period with rq->nr_running, instead of
> > cfs_rq->nr_running? It is my quick thought and I think that we can ensure
> > to run atleast once in this extending sched_period.
> 
> Yeah this seems to be correct.This would ensure sched_min_granularity
> also. So then in the scenarion where there are 2 tgs in a runqueue with
> 10 tasks each,when we calculate the sched_slice of any task,the
> __sched_period() would return 4*20 = 80ms.
> 
> The sched_slice of each of the task would be 80/20 = 4ms. But what about
> the sched_slice of each task group? How would that be calculated then?

Ah... Okay.
I will think more deeply about this issue.

> 
> Let us take the above example and walk through this problem.This would
> probably help us spot the issues involved with this.
> 
> > And, do we leave a problem if we cannot guaranteed atleast once property?
> 
> This would depend on the results of the benchmarks with the changes.I am
> unable to comment on this off the top of my head.

Okay. :)

Thanks for your kind review!!

> 
> Thank you
> 
> Regards
> Preeti U Murthy
> 
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86, reboot: skip reboot_fixups in early boot phase

2013-01-28 Thread Joonsoo Kim

On Thu, Jan 24, 2013 at 08:52:32PM -0800, Greg Kroah-Hartman wrote:
> On Thu, Jan 24, 2013 at 09:21:52PM -0700, Bjorn Helgaas wrote:
> > On Thu, Jan 24, 2013 at 9:14 PM, Greg Kroah-Hartman
> >  wrote:
> > > On Thu, Jan 24, 2013 at 07:59:01PM -0700, Bjorn Helgaas wrote:
> > >> [+cc Greg for driver core]
> > >>
> > >> On Fri, Jan 25, 2013 at 10:13:03AM +0900, Joonsoo Kim wrote:
> > >> > Hello, Bjorn.
> > >> >
> > >> > On Thu, Jan 24, 2013 at 10:45:13AM -0700, Bjorn Helgaas wrote:
> > >> > > On Fri, Dec 28, 2012 at 6:50 AM, Joonsoo Kim  
> > >> > > wrote:
> > >> > > > During early boot phase, PCI bus subsystem is not yet initialized.
> > >> > > > If panic is occured in early boot phase and panic_timeout is set,
> > >> > > > code flow go into emergency_restart() and hit 
> > >> > > > mach_reboot_fixups(), then
> > >> > > > encounter another panic. When second panic, we can't hold a 
> > >> > > > panic_lock, so
> > >> > > > code flow go into panic_smp_self_stop() which prevent system to 
> > >> > > > restart.
> > >> > > >
> > >> > > > For avoid second panic, skip reboot_fixups in early boot phase.
> > >> > > > It makes panic_timeout works in early boot phase.
> > >> > > >
> > >> > > > Cc: Thomas Gleixner 
> > >> > > > Cc: Ingo Molnar 
> > >> > > > Cc: "H. Peter Anvin" 
> > >> > > > Signed-off-by: Joonsoo Kim 
> > >> > > >
> > >> > > > diff --git a/arch/x86/kernel/reboot_fixups_32.c 
> > >> > > > b/arch/x86/kernel/reboot_fixups_32.c
> > >> > > > index c8e41e9..b9b8ec9 100644
> > >> > > > --- a/arch/x86/kernel/reboot_fixups_32.c
> > >> > > > +++ b/arch/x86/kernel/reboot_fixups_32.c
> > >> > > > @@ -89,6 +89,10 @@ void mach_reboot_fixups(void)
> > >> > > > if (in_interrupt())
> > >> > > > return;
> > >> > > >
> > >> > > > +   /* During early boot phase, PCI is not yet initialized */
> > >> > > > +   if (system_state == SYSTEM_BOOTING)
> > >> > > > +   return;
> > >> > > > +
> > >> > > > for (i=0; i < ARRAY_SIZE(fixups_table); i++) {
> > >> > > > cur = &(fixups_table[i]);
> > >> > > > dev = pci_get_device(cur->vendor, cur->device, 
> > >> > > > NULL);
> > >> > >
> > >> > > I guess you're saying that if we call pci_get_device() too early, it
> > >> > > panics?  Did you figure out why that happens?
> > >> > >
> > >> > > If we call pci_get_device() before PCI has been initialized, it would
> > >> > > be good if it just returned NULL, indicating that we didn't find any
> > >> > > matching device.  I looked briefly, and I thought that's what would
> > >> > > happen, but apparently I'm missing something.
> > >> >
> > >> > In bus_find_device(), klist_iter_init_node() is called with
> > >> > @bus->p->klist_device. Before initialization, bus->p is NULL,
> > >> > so panic is occured.
> > >>
> > >> I see.  pci_bus_type.p is initialized by __bus_register() in this path:
> > >>
> > >>   pci_driver_init# postcore_initcall
> > >> bus_register(&pci_bus_type)
> > >>   __bus_register
> > >> priv = kzalloc(sizeof(struct subsys_private))
> > >>   bus->p = priv
> > >>   klist_init(&priv->klist_devices, klist_devices_get, 
> > >> klist_devices_put)
> > >>
> > >> I was hoping we could statically initialize the klist, but that doesn't
> > >> seem likely.
> > >>
> > >> But I wonder if we could do something like the following.  If we could,
> > >> then callers wouldn't have to worry about whether or not the bus has been
> > >> initialized.
> > >
> > > 
> > >
> > > I have no objection to that patch, but really, someone wants to call
> > > pci_find_device() befor

Re: [PATCH] slub: assign refcount for kmalloc_caches

2013-01-28 Thread Joonsoo Kim

On Thu, Jan 24, 2013 at 10:32:32PM -0500, CAI Qian wrote:
> 
> 
> - Original Message -
> > From: "Greg Kroah-Hartman" 
> > To: "Joonsoo Kim" 
> > Cc: "Paul Hargrove" , "Pekka Enberg" 
> > , linux-kernel@vger.kernel.org,
> > linux...@kvack.org, "Christoph Lameter" 
> > Sent: Tuesday, January 15, 2013 3:23:36 AM
> > Subject: Re: [PATCH] slub: assign refcount for kmalloc_caches
> > 
> > On Fri, Jan 11, 2013 at 04:52:54PM +0900, Joonsoo Kim wrote:
> > > On Thu, Jan 10, 2013 at 08:47:39PM -0800, Paul Hargrove wrote:
> > > > I just had a look at patch-3.7.2-rc1, and this change doesn't
> > > > appear to
> > > > have made it in yet.
> > > > Am I missing something?
> > > > 
> > > > -Paul
> > > 
> > > I try to check it.
> > > Ccing to Greg.
> > > 
> > > Hello, Pekka and Greg.
> > > 
> > > v3.8-rcX has already fixed by another stuff, but it is not simple
> > > change.
> > > So I made a new patch and sent it.
> > > 
> > > How this kind of patch (only for stable v3.7) go into stable tree?
> > > through Pekka's slab tree? or send it to Greg, directly?
> > > 
> > > I don't know how to submit this kind of patch to stable tree
> > > exactly.
> > > Could anyone help me?
> > 
> > Please redo it, and send it to sta...@vger.kernel.org, and say
> > exactly
> > why it isn't in Linus's tree, and that it should only be applied to
> > 3.7-stable.
> I also met this during the testing, so I'll re-send it then.

Hello, CAI Qian.

I totally forget this.
Thanks for this work.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 0/3] introduce static_vm for ARM-specific static mapped area

2013-01-28 Thread Joonsoo Kim

On Mon, Jan 28, 2013 at 01:04:24PM -0500, Nicolas Pitre wrote:
> On Mon, 28 Jan 2013, Will Deacon wrote:
> 
> > Hello,
> > 
> > On Thu, Jan 24, 2013 at 01:28:51AM +, Joonsoo Kim wrote:
> > > In current implementation, we used ARM-specific flag, that is,
> > > VM_ARM_STATIC_MAPPING, for distinguishing ARM specific static mapped area.
> > > The purpose of static mapped area is to re-use static mapped area when
> > > entire physical address range of the ioremap request can be covered
> > > by this area.
> > > 
> > > This implementation causes needless overhead for some cases.
> > > For example, assume that there is only one static mapped area and
> > > vmlist has 300 areas. Every time we call ioremap, we check 300 areas for
> > > deciding whether it is matched or not. Moreover, even if there is
> > > no static mapped area and vmlist has 300 areas, every time we call
> > > ioremap, we check 300 areas in now.
> > > 
> > > If we construct a extra list for static mapped area, we can eliminate
> > > above mentioned overhead.
> > > With a extra list, if there is one static mapped area,
> > > we just check only one area and proceed next operation quickly.
> > > 
> > > In fact, it is not a critical problem, because ioremap is not frequently
> > > used. But reducing overhead is better idea.
> > > 
> > > Another reason for doing this work is for removing vm_struct list 
> > > management,
> > > entirely. For more information, look at the following link.
> > > http://lkml.org/lkml/2012/12/6/184
> > 
> > First patch looks good (removing the unused vmregion stuff) but I'm not so
> > sure about the rest of it. If you really care about ioremap performance,
> > perhaps it would be better to have a container struct around the vm_struct
> > for static mappings and then stick them in an augmented rbtree so you can
> > efficiently find the mapping encompassing a particular physical address?
> 
> How can ioremap performance be a problem is the question I had since the 
> beginning.
> 
> Firstly, ioremap is _not_ meant to be used in performance critical 
> paths.
> 
> Secondly, there shouldn't be _that_ many entries on the vmlist such as 
> 300.  That sounds a bit excessive.
> 
> So please, can we discuss the reasons that motivated those patches in 
> the first place?  Maybe that's where the actual problem is.

Hello, Wiil and Nicolas.
First of all, thanks for reviewing.

There is another reason for doing this work.
As mentioned above, I try to remove list management for vm_struct(vmlist),
entirely. For that purpose, removing architecture dependency against vmlist
is needed. Below link is for my RFC patch trying to remove list management
for vm_struct.

http://lkml.org/lkml/2012/12/6/184

Removing dependency for other architectures is rather trivial, but for ARM,
it is not trivial case. So I prepared this patchset.
My description makes you missleading possibly.
Sorry for this.

Answer for your other questions is below.

I know ioremap is _not_ meant to be used in performance critical paths, and
I mentioned it earlier.
"In fact, it is not a critical problem, because ioremap is not frequently used.
But reducing overhead is better idea."

And, there are many entries on the vmlist for my test devices(Android phone).
I saw more than 300 entries in former days, but today, I re-check it and
find 230~250 entries.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 2/3] ARM: static_vm: introduce an infrastructure for static mapped area

2013-01-29 Thread Joonsoo Kim

Hello, Nicolas.

On Tue, Jan 29, 2013 at 07:05:32PM -0500, Nicolas Pitre wrote:
> On Thu, 24 Jan 2013, Joonsoo Kim wrote:
> 
> > From: Joonsoo Kim 
> > 
> > In current implementation, we used ARM-specific flag, that is,
> > VM_ARM_STATIC_MAPPING, for distinguishing ARM specific static mapped area.
> > The purpose of static mapped area is to re-use static mapped area when
> > entire physical address range of the ioremap request can be covered
> > by this area.
> > 
> > This implementation causes needless overhead for some cases.
> > For example, assume that there is only one static mapped area and
> > vmlist has 300 areas. Every time we call ioremap, we check 300 areas for
> > deciding whether it is matched or not. Moreover, even if there is
> > no static mapped area and vmlist has 300 areas, every time we call
> > ioremap, we check 300 areas in now.
> > 
> > If we construct a extra list for static mapped area, we can eliminate
> > above mentioned overhead.
> > With a extra list, if there is one static mapped area,
> > we just check only one area and proceed next operation quickly.
> > 
> > In fact, it is not a critical problem, because ioremap is not frequently
> > used. But reducing overhead is better idea.
> > 
> > Another reason for doing this work is for removing architecture dependency
> > on vmalloc layer. I think that vmlist and vmlist_lock is internal data
> > structure for vmalloc layer. Some codes for debugging and stat inevitably
> > use vmlist and vmlist_lock. But it is preferable that they are used
> > as least as possible in outside of vmalloc.c
> > 
> > Now, I introduce an ARM-specific infrastructure for static mapped area. In
> > the following patch, we will use this and resolve above mentioned problem.
> > 
> > Signed-off-by: Joonsoo Kim 
> > Signed-off-by: Joonsoo Kim 
> 
> First of all, I don't think you really need a new file with a global 
> scope header file.  Given that this code is meant to be used only for 
> ioremap optimization  on ARM, it is probably a better idea to simply put 
> it all into arch/arm/mm/ioremap.c instead.  The only function that needs 
> to be exported out of ioremap.c is insert_static_vm(), and only for the 
> benefit of arch/arm/mm/mmu.c, therefore this function prototype may as 
> well just be added to arch/arm/mm/mm.h.

I agree with your all opinions.
I will re-work and will re-send v4 as soon as possible.
Thanks for review.

> More comments below.
> 
> > diff --git a/arch/arm/include/asm/mach/static_vm.h 
> > b/arch/arm/include/asm/mach/static_vm.h
> > new file mode 100644
> > index 000..72c8339
> > --- /dev/null
> > +++ b/arch/arm/include/asm/mach/static_vm.h
> > @@ -0,0 +1,45 @@
> > +/*
> > + * arch/arm/include/asm/mach/static_vm.h
> > + *
> > + * Copyright (C) 2012 LG Electronics, Joonsoo Kim 
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, write to the Free Software
> > + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
> > + */
> > +
> > +#ifndef _ASM_MACH_STATIC_VM_H
> > +#define _ASM_MACH_STATIC_VM_H
> > +
> > +#include 
> > +#include 
> > +
> > +struct static_vm {
> > +   struct static_vm*next;
> > +   void*vaddr;
> > +   unsigned long   size;
> > +   unsigned long   flags;
> > +   phys_addr_t paddr;
> > +   const void  *caller;
> > +};
> 
> Here you're duplicating most of the vm_struct content for no obvious 
> reasons.  Patch #3 even allocates both a vm_struct and a static_vm 
> instance in parallel for each mapping.  Instead, you should consider 
> something like this:
> 
> struct static_vm {
>   struct static_vm *next;
>   struct vm_struct vm;
> };
> 
> This way, you only need to allocate one structure:
> 
>   struct static_vm *svm = early_alloc(...);
>   ...
>   svm->vm.addr = addr;
>   ...
>   vm_area_add_early(&svm->vm

[PATCH v4 1/3] ARM: vmregion: remove vmregion code entirely

2013-01-30 Thread Joonsoo Kim

Now, there is no user for vmregion.
So remove it.

Acked-by: Nicolas Pitre 
Signed-off-by: Joonsoo Kim 

diff --git a/arch/arm/mm/Makefile b/arch/arm/mm/Makefile
index 8a9c4cb..4e333fa 100644
--- a/arch/arm/mm/Makefile
+++ b/arch/arm/mm/Makefile
@@ -6,7 +6,7 @@ obj-y   := dma-mapping.o extable.o 
fault.o init.o \
   iomap.o
 
 obj-$(CONFIG_MMU)  += fault-armv.o flush.o idmap.o ioremap.o \
-  mmap.o pgd.o mmu.o vmregion.o
+  mmap.o pgd.o mmu.o
 
 ifneq ($(CONFIG_MMU),y)
 obj-y  += nommu.o
diff --git a/arch/arm/mm/vmregion.c b/arch/arm/mm/vmregion.c
deleted file mode 100644
index a631016..000
--- a/arch/arm/mm/vmregion.c
+++ /dev/null
@@ -1,205 +0,0 @@
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-#include "vmregion.h"
-
-/*
- * VM region handling support.
- *
- * This should become something generic, handling VM region allocations for
- * vmalloc and similar (ioremap, module space, etc).
- *
- * I envisage vmalloc()'s supporting vm_struct becoming:
- *
- *  struct vm_struct {
- *struct vmregion  region;
- *unsigned longflags;
- *struct page  **pages;
- *unsigned int nr_pages;
- *unsigned longphys_addr;
- *  };
- *
- * get_vm_area() would then call vmregion_alloc with an appropriate
- * struct vmregion head (eg):
- *
- *  struct vmregion vmalloc_head = {
- * .vm_list= LIST_HEAD_INIT(vmalloc_head.vm_list),
- * .vm_start   = VMALLOC_START,
- * .vm_end = VMALLOC_END,
- *  };
- *
- * However, vmalloc_head.vm_start is variable (typically, it is dependent on
- * the amount of RAM found at boot time.)  I would imagine that get_vm_area()
- * would have to initialise this each time prior to calling vmregion_alloc().
- */
-
-struct arm_vmregion *
-arm_vmregion_alloc(struct arm_vmregion_head *head, size_t align,
-  size_t size, gfp_t gfp, const void *caller)
-{
-   unsigned long start = head->vm_start, addr = head->vm_end;
-   unsigned long flags;
-   struct arm_vmregion *c, *new;
-
-   if (head->vm_end - head->vm_start < size) {
-   printk(KERN_WARNING "%s: allocation too big (requested %#x)\n",
-   __func__, size);
-   goto out;
-   }
-
-   new = kmalloc(sizeof(struct arm_vmregion), gfp);
-   if (!new)
-   goto out;
-
-   new->caller = caller;
-
-   spin_lock_irqsave(&head->vm_lock, flags);
-
-   addr = rounddown(addr - size, align);
-   list_for_each_entry_reverse(c, &head->vm_list, vm_list) {
-   if (addr >= c->vm_end)
-   goto found;
-   addr = rounddown(c->vm_start - size, align);
-   if (addr < start)
-   goto nospc;
-   }
-
- found:
-   /*
-* Insert this entry after the one we found.
-*/
-   list_add(&new->vm_list, &c->vm_list);
-   new->vm_start = addr;
-   new->vm_end = addr + size;
-   new->vm_active = 1;
-
-   spin_unlock_irqrestore(&head->vm_lock, flags);
-   return new;
-
- nospc:
-   spin_unlock_irqrestore(&head->vm_lock, flags);
-   kfree(new);
- out:
-   return NULL;
-}
-
-static struct arm_vmregion *__arm_vmregion_find(struct arm_vmregion_head 
*head, unsigned long addr)
-{
-   struct arm_vmregion *c;
-
-   list_for_each_entry(c, &head->vm_list, vm_list) {
-   if (c->vm_active && c->vm_start == addr)
-   goto out;
-   }
-   c = NULL;
- out:
-   return c;
-}
-
-struct arm_vmregion *arm_vmregion_find(struct arm_vmregion_head *head, 
unsigned long addr)
-{
-   struct arm_vmregion *c;
-   unsigned long flags;
-
-   spin_lock_irqsave(&head->vm_lock, flags);
-   c = __arm_vmregion_find(head, addr);
-   spin_unlock_irqrestore(&head->vm_lock, flags);
-   return c;
-}
-
-struct arm_vmregion *arm_vmregion_find_remove(struct arm_vmregion_head *head, 
unsigned long addr)
-{
-   struct arm_vmregion *c;
-   unsigned long flags;
-
-   spin_lock_irqsave(&head->vm_lock, flags);
-   c = __arm_vmregion_find(head, addr);
-   if (c)
-   c->vm_active = 0;
-   spin_unlock_irqrestore(&head->vm_lock, flags);
-   return c;
-}
-
-void arm_vmregion_free(struct arm_vmregion_head *head, struct arm_vmregion *c)
-{
-   unsigned long flags;
-
-   spin_lock_irqsave(&head->vm_lock, flags);
-   list_del(&c->vm_list);
-   spin_unlock_irqrestore(&head->vm_lock, flags);
-
-   kfree(c);
-}
-
-#ifdef CONFIG_PROC_FS
-static int arm_vmregion_show(struct seq_file *m, void *p)
-{
-   struct arm_vmregion *c = list_entry(p, struct arm_vmre

[PATCH v4 2/3] ARM: ioremap: introduce an infrastructure for static mapped area

2013-01-30 Thread Joonsoo Kim

In current implementation, we used ARM-specific flag, that is,
VM_ARM_STATIC_MAPPING, for distinguishing ARM specific static mapped area.
The purpose of static mapped area is to re-use static mapped area when
entire physical address range of the ioremap request can be covered
by this area.

This implementation causes needless overhead for some cases.
For example, assume that there is only one static mapped area and
vmlist has 300 areas. Every time we call ioremap, we check 300 areas for
deciding whether it is matched or not. Moreover, even if there is
no static mapped area and vmlist has 300 areas, every time we call
ioremap, we check 300 areas in now.

If we construct a extra list for static mapped area, we can eliminate
above mentioned overhead.
With a extra list, if there is one static mapped area,
we just check only one area and proceed next operation quickly.

In fact, it is not a critical problem, because ioremap is not frequently
used. But reducing overhead is better idea.

Another reason for doing this work is for removing architecture dependency
on vmalloc layer. I think that vmlist and vmlist_lock is internal data
structure for vmalloc layer. Some codes for debugging and stat inevitably
use vmlist and vmlist_lock. But it is preferable that they are used
as least as possible in outside of vmalloc.c

Now, I introduce an ARM-specific infrastructure for static mapped area. In
the following patch, we will use this and resolve above mentioned problem.

Signed-off-by: Joonsoo Kim 

diff --git a/arch/arm/mm/ioremap.c b/arch/arm/mm/ioremap.c
index 88fd86c..ceb34ae 100644
--- a/arch/arm/mm/ioremap.c
+++ b/arch/arm/mm/ioremap.c
@@ -39,6 +39,78 @@
 #include 
 #include "mm.h"
 
+
+LIST_HEAD(static_vmlist);
+static DEFINE_RWLOCK(static_vmlist_lock);
+
+static struct static_vm *find_static_vm_paddr(phys_addr_t paddr,
+   size_t size, unsigned long flags)
+{
+   struct static_vm *svm;
+   struct vm_struct *vm;
+
+   read_lock(&static_vmlist_lock);
+   list_for_each_entry(svm, &static_vmlist, list) {
+   if (svm->flags != flags)
+   continue;
+
+   vm = &svm->vm;
+   if (vm->phys_addr > paddr ||
+   paddr + size - 1 > vm->phys_addr + vm->size - 1)
+   continue;
+
+   read_unlock(&static_vmlist_lock);
+   return svm;
+   }
+
+   return NULL;
+}
+
+struct static_vm *find_static_vm_vaddr(void *vaddr)
+{
+   struct static_vm *svm;
+   struct vm_struct *vm;
+
+   read_lock(&static_vmlist_lock);
+   list_for_each_entry(svm, &static_vmlist, list) {
+   vm = &svm->vm;
+
+   /* static_vmlist is ascending order */
+   if (vm->addr > vaddr)
+   break;
+
+   if (vm->addr <= vaddr && vm->addr + vm->size > vaddr) {
+   read_unlock(&static_vmlist_lock);
+   return svm;
+   }
+   }
+   read_unlock(&static_vmlist_lock);
+
+   return NULL;
+}
+
+void add_static_vm_early(struct static_vm *svm, unsigned long flags)
+{
+   struct static_vm *curr_svm;
+   struct vm_struct *vm;
+   void *vaddr;
+
+   vm_area_add_early(&svm->vm);
+
+   vaddr = svm->vm.addr;
+   svm->flags = flags;
+
+   write_lock(&static_vmlist_lock);
+   list_for_each_entry(curr_svm, &static_vmlist, list) {
+   vm = &curr_svm->vm;
+
+   if (vm->addr > vaddr)
+   break;
+   }
+   list_add_tail(&svm->list, &curr_svm->list);
+   write_unlock(&static_vmlist_lock);
+}
+
 int ioremap_page(unsigned long virt, unsigned long phys,
 const struct mem_type *mtype)
 {
diff --git a/arch/arm/mm/mm.h b/arch/arm/mm/mm.h
index a8ee92d..fb45c79 100644
--- a/arch/arm/mm/mm.h
+++ b/arch/arm/mm/mm.h
@@ -1,4 +1,6 @@
 #ifdef CONFIG_MMU
+#include 
+#include 
 
 /* the upper-most page table pointer */
 extern pmd_t *top_pmd;
@@ -65,6 +67,24 @@ extern void __flush_dcache_page(struct address_space 
*mapping, struct page *page
 /* consistent regions used by dma_alloc_attrs() */
 #define VM_ARM_DMA_CONSISTENT  0x2000
 
+
+/* ARM specific static_vm->flags bits */
+#define STATIC_VM_MEM  0x0001
+#define STATIC_VM_EMPTY0x0002
+#define STATIC_VM_MTYPE(mtype) ((mtype) << 20)
+
+#define STATIC_VM_TYPE(type, mtype) (type | STATIC_VM_MTYPE(mtype))
+
+struct static_vm {
+   struct vm_struct vm;
+   struct list_head list;
+   unsigned long flags;
+};
+
+extern struct list_head static_vmlist;
+extern struct static_vm *find_static_vm_vaddr(void *vaddr);
+extern void add_static_vm_early(struct static_vm *svm, unsigned long flags);
+
 #endif
 
 #ifdef CONFIG_ZONE_DMA
-- 
1.7.9.5

--
To unsubscri

[PATCH v4 0/3] introduce static_vm for ARM-specific static mapped area

2013-01-30 Thread Joonsoo Kim

In current implementation, we used ARM-specific flag, that is,
VM_ARM_STATIC_MAPPING, for distinguishing ARM specific static mapped area.
The purpose of static mapped area is to re-use static mapped area when
entire physical address range of the ioremap request can be covered
by this area.

This implementation causes needless overhead for some cases.
For example, assume that there is only one static mapped area and
vmlist has 300 areas. Every time we call ioremap, we check 300 areas for
deciding whether it is matched or not. Moreover, even if there is
no static mapped area and vmlist has 300 areas, every time we call
ioremap, we check 300 areas in now.

If we construct a extra list for static mapped area, we can eliminate
above mentioned overhead.
With a extra list, if there is one static mapped area,
we just check only one area and proceed next operation quickly.

In fact, it is not a critical problem, because ioremap is not frequently
used. But reducing overhead is better idea.

Another reason for doing this work is for removing vm_struct list management,
entirely. For more information, look at the following link.
http://lkml.org/lkml/2012/12/6/184

Changelog
v3->v4:
  [2/3]: Changes from Nicolas' suggestion
 - embed static_vm code in ioremap.c
 - simplify struct static_vm
 - remove init_static_vm, instead, add_static_vm_early() init static_vm
 Use generic list for list management of static_vm
 Convert spin_lock to rw_lock
 Modify static_vm's flags bits
  [3/3]: Rework according to [2/3] change
  Rebased on v3.8-rc5

v2->v3:
  coverletter: refer a link related to this work
  [2/3]: drop @flags of find_static_vm_vaddr
  Rebased on v3.8-rc4

v1->v2: 
  [2/3]: patch description is improved.
  Rebased on v3.7-rc7

Joonsoo Kim (3):
  ARM: vmregion: remove vmregion code entirely
  ARM: ioremap: introduce an infrastructure for static mapped area
  ARM: mm: use static_vm for managing static mapped areas

 arch/arm/mm/Makefile   |2 +-
 arch/arm/mm/ioremap.c  |  147 +++---
 arch/arm/mm/mm.h   |   28 ---
 arch/arm/mm/mmu.c  |   47 ++-
 arch/arm/mm/vmregion.c |  205 
 arch/arm/mm/vmregion.h |   31 
 6 files changed, 147 insertions(+), 313 deletions(-)
 delete mode 100644 arch/arm/mm/vmregion.c
 delete mode 100644 arch/arm/mm/vmregion.h

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 3/3] ARM: mm: use static_vm for managing static mapped areas

2013-01-30 Thread Joonsoo Kim

A static mapped area is ARM-specific, so it is better not to use
generic vmalloc data structure, that is, vmlist and vmlist_lock
for managing static mapped area. And it causes some needless overhead and
reducing this overhead is better idea.

Now, we have newly introduced static_vm infrastructure.
With it, we don't need to iterate all mapped areas. Instead, we just
iterate static mapped areas. It helps to reduce an overhead of finding
matched area. And architecture dependency on vmalloc layer is removed,
so it will help to maintainability for vmalloc layer.

Signed-off-by: Joonsoo Kim 

diff --git a/arch/arm/mm/ioremap.c b/arch/arm/mm/ioremap.c
index ceb34ae..7fe5b48 100644
--- a/arch/arm/mm/ioremap.c
+++ b/arch/arm/mm/ioremap.c
@@ -269,13 +269,14 @@ void __iomem * __arm_ioremap_pfn_caller(unsigned long pfn,
const struct mem_type *type;
int err;
unsigned long addr;
-   struct vm_struct * area;
+   struct vm_struct *area;
+   phys_addr_t paddr = __pfn_to_phys(pfn);
 
 #ifndef CONFIG_ARM_LPAE
/*
 * High mappings must be supersection aligned
 */
-   if (pfn >= 0x10 && (__pfn_to_phys(pfn) & ~SUPERSECTION_MASK))
+   if (pfn >= 0x10 && (paddr & ~SUPERSECTION_MASK))
return NULL;
 #endif
 
@@ -291,24 +292,17 @@ void __iomem * __arm_ioremap_pfn_caller(unsigned long pfn,
/*
 * Try to reuse one of the static mapping whenever possible.
 */
-   read_lock(&vmlist_lock);
-   for (area = vmlist; area; area = area->next) {
-   if (!size || (sizeof(phys_addr_t) == 4 && pfn >= 0x10))
-   break;
-   if (!(area->flags & VM_ARM_STATIC_MAPPING))
-   continue;
-   if ((area->flags & VM_ARM_MTYPE_MASK) != VM_ARM_MTYPE(mtype))
-   continue;
-   if (__phys_to_pfn(area->phys_addr) > pfn ||
-   __pfn_to_phys(pfn) + size-1 > area->phys_addr + 
area->size-1)
-   continue;
-   /* we can drop the lock here as we know *area is static */
-   read_unlock(&vmlist_lock);
-   addr = (unsigned long)area->addr;
-   addr += __pfn_to_phys(pfn) - area->phys_addr;
-   return (void __iomem *) (offset + addr);
+   if (size && !((sizeof(phys_addr_t) == 4 && pfn >= 0x10))) {
+   struct static_vm *svm;
+
+   svm = find_static_vm_paddr(paddr, size,
+   STATIC_VM_TYPE(STATIC_VM_MEM, mtype));
+   if (svm) {
+   addr = (unsigned long)svm->vm.addr;
+   addr += paddr - svm->vm.phys_addr;
+   return (void __iomem *) (offset + addr);
+   }
}
-   read_unlock(&vmlist_lock);
 
/*
 * Don't allow RAM to be mapped - this causes problems with ARMv6+
@@ -320,21 +314,21 @@ void __iomem * __arm_ioremap_pfn_caller(unsigned long pfn,
if (!area)
return NULL;
addr = (unsigned long)area->addr;
-   area->phys_addr = __pfn_to_phys(pfn);
+   area->phys_addr = paddr;
 
 #if !defined(CONFIG_SMP) && !defined(CONFIG_ARM_LPAE)
if (DOMAIN_IO == 0 &&
(((cpu_architecture() >= CPU_ARCH_ARMv6) && (get_cr() & CR_XP)) ||
   cpu_is_xsc3()) && pfn >= 0x10 &&
-  !((__pfn_to_phys(pfn) | size | addr) & ~SUPERSECTION_MASK)) {
+  !((paddr | size | addr) & ~SUPERSECTION_MASK)) {
area->flags |= VM_ARM_SECTION_MAPPING;
err = remap_area_supersections(addr, pfn, size, type);
-   } else if (!((__pfn_to_phys(pfn) | size | addr) & ~PMD_MASK)) {
+   } else if (!((paddr | size | addr) & ~PMD_MASK)) {
area->flags |= VM_ARM_SECTION_MAPPING;
err = remap_area_sections(addr, pfn, size, type);
} else
 #endif
-   err = ioremap_page_range(addr, addr + size, __pfn_to_phys(pfn),
+   err = ioremap_page_range(addr, addr + size, paddr,
 __pgprot(type->prot_pte));
 
if (err) {
@@ -418,34 +412,21 @@ __arm_ioremap_exec(unsigned long phys_addr, size_t size, 
bool cached)
 void __iounmap(volatile void __iomem *io_addr)
 {
void *addr = (void *)(PAGE_MASK & (unsigned long)io_addr);
-   struct vm_struct *vm;
+   struct static_vm *svm;
+
+   svm = find_static_vm_vaddr(addr);
+   if (svm)
+   return;
 
-   read_lock(&vmlist_lock);
-   for (vm = vmlist; vm; vm = vm->next) {
-   if (vm->addr > addr)
-   break;
-   if (!(vm->flags & VM_IOREMAP))
-

Re: [PATCH v4 3/3] ARM: mm: use static_vm for managing static mapped areas

2013-02-01 Thread JoonSoo Kim

2013/2/1 Nicolas Pitre :
> On Thu, 31 Jan 2013, Joonsoo Kim wrote:
>
>> A static mapped area is ARM-specific, so it is better not to use
>> generic vmalloc data structure, that is, vmlist and vmlist_lock
>> for managing static mapped area. And it causes some needless overhead and
>> reducing this overhead is better idea.
>>
>> Now, we have newly introduced static_vm infrastructure.
>> With it, we don't need to iterate all mapped areas. Instead, we just
>> iterate static mapped areas. It helps to reduce an overhead of finding
>> matched area. And architecture dependency on vmalloc layer is removed,
>> so it will help to maintainability for vmalloc layer.
>>
>> Signed-off-by: Joonsoo Kim 
>
> Comments below.
>
>>
>> diff --git a/arch/arm/mm/ioremap.c b/arch/arm/mm/ioremap.c
>> index ceb34ae..7fe5b48 100644
>> --- a/arch/arm/mm/ioremap.c
>> +++ b/arch/arm/mm/ioremap.c
>> @@ -269,13 +269,14 @@ void __iomem * __arm_ioremap_pfn_caller(unsigned long 
>> pfn,
>>   const struct mem_type *type;
>>   int err;
>>   unsigned long addr;
>> - struct vm_struct * area;
>> + struct vm_struct *area;
>> + phys_addr_t paddr = __pfn_to_phys(pfn);
>>
>>  #ifndef CONFIG_ARM_LPAE
>>   /*
>>* High mappings must be supersection aligned
>>*/
>> - if (pfn >= 0x10 && (__pfn_to_phys(pfn) & ~SUPERSECTION_MASK))
>> + if (pfn >= 0x10 && (paddr & ~SUPERSECTION_MASK))
>>   return NULL;
>>  #endif
>>
>> @@ -291,24 +292,17 @@ void __iomem * __arm_ioremap_pfn_caller(unsigned long 
>> pfn,
>>   /*
>>* Try to reuse one of the static mapping whenever possible.
>>*/
>> - read_lock(&vmlist_lock);
>> - for (area = vmlist; area; area = area->next) {
>> - if (!size || (sizeof(phys_addr_t) == 4 && pfn >= 0x10))
>> - break;
>> - if (!(area->flags & VM_ARM_STATIC_MAPPING))
>> - continue;
>> - if ((area->flags & VM_ARM_MTYPE_MASK) != VM_ARM_MTYPE(mtype))
>> - continue;
>> - if (__phys_to_pfn(area->phys_addr) > pfn ||
>> - __pfn_to_phys(pfn) + size-1 > area->phys_addr + 
>> area->size-1)
>> - continue;
>> - /* we can drop the lock here as we know *area is static */
>> - read_unlock(&vmlist_lock);
>> - addr = (unsigned long)area->addr;
>> - addr += __pfn_to_phys(pfn) - area->phys_addr;
>> - return (void __iomem *) (offset + addr);
>> + if (size && !((sizeof(phys_addr_t) == 4 && pfn >= 0x10))) {
>> + struct static_vm *svm;
>> +
>> + svm = find_static_vm_paddr(paddr, size,
>> + STATIC_VM_TYPE(STATIC_VM_MEM, mtype));
>> + if (svm) {
>> + addr = (unsigned long)svm->vm.addr;
>> + addr += paddr - svm->vm.phys_addr;
>> + return (void __iomem *) (offset + addr);
>> + }
>>   }
>> - read_unlock(&vmlist_lock);
>>
>>   /*
>>* Don't allow RAM to be mapped - this causes problems with ARMv6+
>> @@ -320,21 +314,21 @@ void __iomem * __arm_ioremap_pfn_caller(unsigned long 
>> pfn,
>>   if (!area)
>>   return NULL;
>>   addr = (unsigned long)area->addr;
>> - area->phys_addr = __pfn_to_phys(pfn);
>> + area->phys_addr = paddr;
>>
>>  #if !defined(CONFIG_SMP) && !defined(CONFIG_ARM_LPAE)
>>   if (DOMAIN_IO == 0 &&
>>   (((cpu_architecture() >= CPU_ARCH_ARMv6) && (get_cr() & CR_XP)) ||
>>  cpu_is_xsc3()) && pfn >= 0x10 &&
>> -!((__pfn_to_phys(pfn) | size | addr) & ~SUPERSECTION_MASK)) {
>> +!((paddr | size | addr) & ~SUPERSECTION_MASK)) {
>>   area->flags |= VM_ARM_SECTION_MAPPING;
>>   err = remap_area_supersections(addr, pfn, size, type);
>> - } else if (!((__pfn_to_phys(pfn) | size | addr) & ~PMD_MASK)) {
>> + } else if (!((paddr | size | addr) & ~PMD_MASK)) {
>>   area->flags |= VM_ARM_SECTION_MAPPING;
>>   err = remap_area_sections(addr, pfn, size, type);
>>   } else
>>  #endif
>> -

Re: [PATCH v4 2/3] ARM: ioremap: introduce an infrastructure for static mapped area

2013-02-01 Thread JoonSoo Kim

Hello, Nicolas.

2013/2/1 Nicolas Pitre :
> On Thu, 31 Jan 2013, Joonsoo Kim wrote:
>
>> In current implementation, we used ARM-specific flag, that is,
>> VM_ARM_STATIC_MAPPING, for distinguishing ARM specific static mapped area.
>> The purpose of static mapped area is to re-use static mapped area when
>> entire physical address range of the ioremap request can be covered
>> by this area.
>>
>> This implementation causes needless overhead for some cases.
>> For example, assume that there is only one static mapped area and
>> vmlist has 300 areas. Every time we call ioremap, we check 300 areas for
>> deciding whether it is matched or not. Moreover, even if there is
>> no static mapped area and vmlist has 300 areas, every time we call
>> ioremap, we check 300 areas in now.
>>
>> If we construct a extra list for static mapped area, we can eliminate
>> above mentioned overhead.
>> With a extra list, if there is one static mapped area,
>> we just check only one area and proceed next operation quickly.
>>
>> In fact, it is not a critical problem, because ioremap is not frequently
>> used. But reducing overhead is better idea.
>>
>> Another reason for doing this work is for removing architecture dependency
>> on vmalloc layer. I think that vmlist and vmlist_lock is internal data
>> structure for vmalloc layer. Some codes for debugging and stat inevitably
>> use vmlist and vmlist_lock. But it is preferable that they are used
>> as least as possible in outside of vmalloc.c
>>
>> Now, I introduce an ARM-specific infrastructure for static mapped area. In
>> the following patch, we will use this and resolve above mentioned problem.
>>
>> Signed-off-by: Joonsoo Kim 
>
> Much better.  Comments below.

Thanks.

>> diff --git a/arch/arm/mm/ioremap.c b/arch/arm/mm/ioremap.c
>> index 88fd86c..ceb34ae 100644
>> --- a/arch/arm/mm/ioremap.c
>> +++ b/arch/arm/mm/ioremap.c
>> @@ -39,6 +39,78 @@
>>  #include 
>>  #include "mm.h"
>>
>> +
>> +LIST_HEAD(static_vmlist);
>> +static DEFINE_RWLOCK(static_vmlist_lock);
>
> In fact you don't need a lock at all.  The only writer is
> add_static_vm_early() and we know it is only used during boot when the
> kernel is still single-threaded.

Yes!

>> +
>> +static struct static_vm *find_static_vm_paddr(phys_addr_t paddr,
>> + size_t size, unsigned long flags)
>> +{
>> + struct static_vm *svm;
>> + struct vm_struct *vm;
>> +
>> + read_lock(&static_vmlist_lock);
>> + list_for_each_entry(svm, &static_vmlist, list) {
>> + if (svm->flags != flags)
>> + continue;
>> +
>> + vm = &svm->vm;
>> + if (vm->phys_addr > paddr ||
>> + paddr + size - 1 > vm->phys_addr + vm->size - 1)
>> + continue;
>> +
>> + read_unlock(&static_vmlist_lock);
>> + return svm;
>> + }
>> +
>> + return NULL;
>> +}
>> +
>> +struct static_vm *find_static_vm_vaddr(void *vaddr)
>> +{
>> + struct static_vm *svm;
>> + struct vm_struct *vm;
>> +
>> + read_lock(&static_vmlist_lock);
>> + list_for_each_entry(svm, &static_vmlist, list) {
>> + vm = &svm->vm;
>> +
>> + /* static_vmlist is ascending order */
>> + if (vm->addr > vaddr)
>> + break;
>> +
>> + if (vm->addr <= vaddr && vm->addr + vm->size > vaddr) {
>> + read_unlock(&static_vmlist_lock);
>> + return svm;
>> + }
>> + }
>> + read_unlock(&static_vmlist_lock);
>> +
>> + return NULL;
>> +}
>> +
>> +void add_static_vm_early(struct static_vm *svm, unsigned long flags)
>
> This should be marked with __init.  This way, it is less likely to be
> used after boot, especially with no locking.  And vm_area_add_early() is
> valid only if !vmap_initialized anyway, and also __init.

Okay.

>> +{
>> + struct static_vm *curr_svm;
>> + struct vm_struct *vm;
>> + void *vaddr;
>> +
>> + vm_area_add_early(&svm->vm);
>> +
>> + vaddr = svm->vm.addr;
>> + svm->flags = flags;
>> +
>> + write_lock(&static_vmlist_lock);
>> + list_for_each_entry(curr_svm, &static_vmlist, list) {
>> + vm =

Re: [PATCH v2 15/17] workqueue: remove global_cwq

2013-01-24 Thread Joonsoo Kim

Hello, Tejun.

On Wed, Jan 23, 2013 at 10:09:57AM -0800, Tejun Heo wrote:
> global_cwq is now nothing but a container for per-cpu standard
> worker_pools.  Declare the worker pools directly as
> cpu/unbound_std_worker_pools[] and remove global_cwq.
> 
> * cacheline_aligned_in_smp moved from global_cwq to worker_pool.
>   This probably would have made sense even before this change as we
>   want each pool to be aligned.
> 
> * get_gcwq() is replaced with std_worker_pools() which returns the
>   pointer to the standard pool array for a given CPU.
> 
> * __alloc_workqueue_key() updated to use get_std_worker_pool() instead
>   of open-coding pool determination.
> 
> This is part of an effort to remove global_cwq and make worker_pool
> the top level abstraction, which in turn will help implementing worker
> pools with user-specified attributes.
> 
> v2: Joonsoo pointed out that it'd better to align struct worker_pool
> rather than the array so that every pool is aligned.
> 
> Signed-off-by: Tejun Heo 
> Cc: Joonsoo Kim 
> ---
> Rebased on top of the current wq/for-3.9 and Joonsoo's comments
> applied.
> 
> Thanks.
> 
>  kernel/workqueue.c  |   46 
> 
>  kernel/workqueue_internal.h |1 
>  2 files changed, 17 insertions(+), 30 deletions(-)
> 
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -144,16 +144,6 @@ struct worker_pool {
>  
>   struct mutexassoc_mutex;/* protect POOL_DISASSOCIATED */
>   struct ida  worker_ida; /* L: for worker IDs */
> -};
> -
> -/*
> - * Global per-cpu workqueue.  There's one and only one for each cpu
> - * and all works are queued and processed here regardless of their
> - * target workqueues.
> - */
> -struct global_cwq {
> - struct worker_pool  pools[NR_STD_WORKER_POOLS];
> - /* normal and highpri pools */
>  } cacheline_aligned_in_smp;
>  
>  /*
> @@ -250,8 +240,8 @@ EXPORT_SYMBOL_GPL(system_freezable_wq);
>  #include 
>  
>  #define for_each_std_worker_pool(pool, cpu)  \
> - for ((pool) = &get_gcwq((cpu))->pools[0];   \
> -  (pool) < &get_gcwq((cpu))->pools[NR_STD_WORKER_POOLS]; (pool)++)
> + for ((pool) = &std_worker_pools(cpu)[0];\
> +  (pool) < &std_worker_pools(cpu)[NR_STD_WORKER_POOLS]; (pool)++)
>  
>  #define for_each_busy_worker(worker, i, pos, pool)   \
>   hash_for_each(pool->busy_hash, i, pos, worker, hentry)
> @@ -427,19 +417,19 @@ static LIST_HEAD(workqueues);
>  static bool workqueue_freezing;  /* W: have wqs started 
> freezing? */
>  
>  /*
> - * The almighty global cpu workqueues.  nr_running is the only field
> - * which is expected to be used frequently by other cpus via
> - * try_to_wake_up().  Put it in a separate cacheline.
> + * The CPU standard worker pools.  nr_running is the only field which is
> + * expected to be used frequently by other cpus via try_to_wake_up().  Put
> + * it in a separate cacheline.
>   */
> -static DEFINE_PER_CPU(struct global_cwq, global_cwq);
> +static DEFINE_PER_CPU(struct worker_pool [NR_STD_WORKER_POOLS],
> +   cpu_std_worker_pools);
>  static DEFINE_PER_CPU_SHARED_ALIGNED(atomic_t, 
> pool_nr_running[NR_STD_WORKER_POOLS]);

AFAIK, worker_pool can be accessed by other cpus.
So, I think that we also need "DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool 
[], xxx)".

>  /*
> - * Global cpu workqueue and nr_running counter for unbound gcwq.  The pools
> - * for online CPUs have POOL_DISASSOCIATED set, and all their workers have
> - * WORKER_UNBOUND set.
> + * Standard worker pools and nr_running counter for unbound CPU.  The pools
> + * have POOL_DISASSOCIATED set, and all workers have WORKER_UNBOUND set.
>   */
> -static struct global_cwq unbound_global_cwq;
> +static struct worker_pool unbound_std_worker_pools[NR_STD_WORKER_POOLS];
>  static atomic_t unbound_pool_nr_running[NR_STD_WORKER_POOLS] = {
>   [0 ... NR_STD_WORKER_POOLS - 1] = ATOMIC_INIT(0),   /* always 0 */
>  };
> @@ -450,17 +440,17 @@ static DEFINE_IDR(worker_pool_idr);
>  
>  static int worker_thread(void *__worker);
>  
> -static struct global_cwq *get_gcwq(unsigned int cpu)
> +static struct worker_pool *std_worker_pools(int cpu)
>  {
>   if (cpu != WORK_CPU_UNBOUND)
> - return &per_cpu(global_cwq, cpu);
> + return per_cpu(cpu_std_worker_pools, cpu);
>   else
> - return &unbound_global_cwq;
> + return unbound_st

Re: [PATCH] x86, reboot: skip reboot_fixups in early boot phase

2013-01-24 Thread Joonsoo Kim

Hello, Bjorn.

On Thu, Jan 24, 2013 at 10:45:13AM -0700, Bjorn Helgaas wrote:
> On Fri, Dec 28, 2012 at 6:50 AM, Joonsoo Kim  wrote:
> > During early boot phase, PCI bus subsystem is not yet initialized.
> > If panic is occured in early boot phase and panic_timeout is set,
> > code flow go into emergency_restart() and hit mach_reboot_fixups(), then
> > encounter another panic. When second panic, we can't hold a panic_lock, so
> > code flow go into panic_smp_self_stop() which prevent system to restart.
> >
> > For avoid second panic, skip reboot_fixups in early boot phase.
> > It makes panic_timeout works in early boot phase.
> >
> > Cc: Thomas Gleixner 
> > Cc: Ingo Molnar 
> > Cc: "H. Peter Anvin" 
> > Signed-off-by: Joonsoo Kim 
> >
> > diff --git a/arch/x86/kernel/reboot_fixups_32.c 
> > b/arch/x86/kernel/reboot_fixups_32.c
> > index c8e41e9..b9b8ec9 100644
> > --- a/arch/x86/kernel/reboot_fixups_32.c
> > +++ b/arch/x86/kernel/reboot_fixups_32.c
> > @@ -89,6 +89,10 @@ void mach_reboot_fixups(void)
> > if (in_interrupt())
> > return;
> >
> > +   /* During early boot phase, PCI is not yet initialized */
> > +   if (system_state == SYSTEM_BOOTING)
> > +   return;
> > +
> > for (i=0; i < ARRAY_SIZE(fixups_table); i++) {
> > cur = &(fixups_table[i]);
> > dev = pci_get_device(cur->vendor, cur->device, NULL);
> 
> I guess you're saying that if we call pci_get_device() too early, it
> panics?  Did you figure out why that happens?
> 
> If we call pci_get_device() before PCI has been initialized, it would
> be good if it just returned NULL, indicating that we didn't find any
> matching device.  I looked briefly, and I thought that's what would
> happen, but apparently I'm missing something.

In bus_find_device(), klist_iter_init_node() is called with 
@bus->p->klist_device. Before initialization, bus->p is NULL,
so panic is occured.

> Bjorn
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv6 4/8] zswap: add to mm/

2013-02-24 Thread Joonsoo Kim

Hello, Seth.
Here comes minor comments.

On Wed, Feb 20, 2013 at 04:04:44PM -0600, Seth Jennings wrote:
> zswap is a thin compression backend for frontswap. It receives
> pages from frontswap and attempts to store them in a compressed
> memory pool, resulting in an effective partial memory reclaim and
> dramatically reduced swap device I/O.
> 
> Additionally, in most cases, pages can be retrieved from this
> compressed store much more quickly than reading from tradition
> swap devices resulting in faster performance for many workloads.
> 
> This patch adds the zswap driver to mm/
> 
> Signed-off-by: Seth Jennings 
> ---
>  mm/Kconfig  |  15 ++
>  mm/Makefile |   1 +
>  mm/zswap.c  | 665 
> 
>  3 files changed, 681 insertions(+)
>  create mode 100644 mm/zswap.c
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 25b8f38..f9f35b7 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -470,3 +470,18 @@ config PGTABLE_MAPPING
>  
> You can check speed with zsmalloc benchmark[1].
> [1] https://github.com/spartacus06/zsmalloc
> +
> +config ZSWAP
> + bool "In-kernel swap page compression"
> + depends on FRONTSWAP && CRYPTO
> + select CRYPTO_LZO
> + select ZSMALLOC
> + default n
> + help
> +   Zswap is a backend for the frontswap mechanism in the VMM.
> +   It receives pages from frontswap and attempts to store them
> +   in a compressed memory pool, resulting in an effective
> +   partial memory reclaim.  In addition, pages and be retrieved
> +   from this compressed store much faster than most tradition
> +   swap devices resulting in reduced I/O and faster performance
> +   for many workloads.
> diff --git a/mm/Makefile b/mm/Makefile
> index 0f6ef0a..1e0198f 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -32,6 +32,7 @@ obj-$(CONFIG_HAVE_MEMBLOCK) += memblock.o
>  obj-$(CONFIG_BOUNCE) += bounce.o
>  obj-$(CONFIG_SWAP)   += page_io.o swap_state.o swapfile.o
>  obj-$(CONFIG_FRONTSWAP)  += frontswap.o
> +obj-$(CONFIG_ZSWAP)  += zswap.o
>  obj-$(CONFIG_HAS_DMA)+= dmapool.o
>  obj-$(CONFIG_HUGETLBFS)  += hugetlb.o
>  obj-$(CONFIG_NUMA)   += mempolicy.o
> diff --git a/mm/zswap.c b/mm/zswap.c
> new file mode 100644
> index 000..d3b4943
> --- /dev/null
> +++ b/mm/zswap.c
> @@ -0,0 +1,665 @@
> +/*
> + * zswap.c - zswap driver file
> + *
> + * zswap is a backend for frontswap that takes pages that are in the
> + * process of being swapped out and attempts to compress them and store
> + * them in a RAM-based memory pool.  This results in a significant I/O
> + * reduction on the real swap device and, in the case of a slow swap
> + * device, can also improve workload performance.
> + *
> + * Copyright (C) 2012  Seth Jennings 
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version 2
> + * of the License, or (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> +*/
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +/*
> +* statistics
> +**/
> +/* Number of memory pages used by the compressed pool */
> +static atomic_t zswap_pool_pages = ATOMIC_INIT(0);
> +/* The number of compressed pages currently stored in zswap */
> +static atomic_t zswap_stored_pages = ATOMIC_INIT(0);
> +
> +/*
> + * The statistics below are not protected from concurrent access for
> + * performance reasons so they may not be a 100% accurate.  However,
> + * they do provide useful information on roughly how many times a
> + * certain event is occurring.
> +*/
> +static u64 zswap_pool_limit_hit;
> +static u64 zswap_reject_compress_poor;
> +static u64 zswap_reject_zsmalloc_fail;
> +static u64 zswap_reject_kmemcache_fail;
> +static u64 zswap_duplicate_entry;
> +
> +/*
> +* tunables
> +**/
> +/* Enable/disable zswap (disabled by default, fixed at boot for now) */
> +static bool zswap_enabled;
> +module_param_named(enabled, zswap_enabled, bool, 0);
> +
> +/* Compressor to be used by zswap (fixed at boot for now) */
> +#define ZSWAP_COMPRESSOR_DEFAULT "lzo"
> +static char *zswap_compressor = ZSWAP_COMPRESSOR_DEFAULT;
> +module_param_named(compressor, zswap_compressor, charp, 0);
> +
> +/* The maximum percentage of memory that the compressed pool can occupy */
> +static unsigned int zswap_max_pool_percent = 20;
> +module_para

Re: [PATCH 0/8] correct load_balance()

2013-02-24 Thread Joonsoo Kim

On Thu, Feb 14, 2013 at 02:48:33PM +0900, Joonsoo Kim wrote:
> Commit 88b8dac0 makes load_balance() consider other cpus in its group.
> But, there are some missing parts for this feature to work properly.
> This patchset correct these things and make load_balance() robust.
> 
> Others are related to LBF_ALL_PINNED. This is fallback functionality
> when all tasks can't be moved as cpu affinity. But, currently,
> if imbalance is not large enough to task's load, we leave LBF_ALL_PINNED
> flag and 'redo' is triggered. This is not our intention, so correct it.
> 
> These are based on v3.8-rc7.
> 
> Joonsoo Kim (8):
>   sched: change position of resched_cpu() in load_balance()
>   sched: explicitly cpu_idle_type checking in rebalance_domains()
>   sched: don't consider other cpus in our group in case of NEWLY_IDLE
>   sched: clean up move_task() and move_one_task()
>   sched: move up affinity check to mitigate useless redoing overhead
>   sched: rename load_balance_tmpmask to load_balance_cpu_active
>   sched: prevent to re-select dst-cpu in load_balance()
>   sched: reset lb_env when redo in load_balance()
> 
>  kernel/sched/core.c |9 +++--
>  kernel/sched/fair.c |  107 
> +--
>  2 files changed, 67 insertions(+), 49 deletions(-)

Hello, Ingo and Peter.

Could you review this patch set?
Please let me know what I should do for merging this?

Thanks.

> -- 
> 1.7.9.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] slub: correctly bootstrap boot caches

2013-02-24 Thread Joonsoo Kim

Hello, Christoph.

On Sun, Feb 24, 2013 at 12:35:22AM +, Christoph Lameter wrote:
> On Sat, 23 Feb 2013, JoonSoo Kim wrote:
> 
> > With flushing, deactivate_slab() occur and it has some overhead to
> > deactivate objects.
> > If my patch properly fix this situation, it is better to use mine
> > which has no overhead.
> 
> Well this occurs during boot and its not that performance critical.

Hmm...
Yes, this is not performance critical place, but why do we use
a sub-optimal solution?

And flushing is abstration for complicated logic, so I think that
my raw implemntation is better for understanding.

But, I have no objection to merge Glauber's one
if you think that is better.

Thanks.

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCHv5 1/8] zsmalloc: add to mm/

2013-02-19 Thread Joonsoo Kim

Hello, Seth.
I'm not sure that this is right time to review, because I already have
seen many effort of various people to promote zxxx series. I don't want to
be a stopper to promote these. :)

But, I read the code, now, and then some comments below.

On Wed, Feb 13, 2013 at 12:38:44PM -0600, Seth Jennings wrote:
> =
> DO NOT MERGE, FOR REVIEW ONLY
> This patch introduces zsmalloc as new code, however, it already
> exists in drivers/staging.  In order to build successfully, you
> must select EITHER to driver/staging version OR this version.
> Once zsmalloc is reviewed in this format (and hopefully accepted),
> I will create a new patchset that properly promotes zsmalloc from
> staging.
> =
> 
> This patchset introduces a new slab-based memory allocator,
> zsmalloc, for storing compressed pages.  It is designed for
> low fragmentation and high allocation success rate on
> large object, but <= PAGE_SIZE allocations.
> 
> zsmalloc differs from the kernel slab allocator in two primary
> ways to achieve these design goals.
> 
> zsmalloc never requires high order page allocations to back
> slabs, or "size classes" in zsmalloc terms. Instead it allows
> multiple single-order pages to be stitched together into a
> "zspage" which backs the slab.  This allows for higher allocation
> success rate under memory pressure.
> 
> Also, zsmalloc allows objects to span page boundaries within the
> zspage.  This allows for lower fragmentation than could be had
> with the kernel slab allocator for objects between PAGE_SIZE/2
> and PAGE_SIZE.  With the kernel slab allocator, if a page compresses
> to 60% of it original size, the memory savings gained through
> compression is lost in fragmentation because another object of
> the same size can't be stored in the leftover space.
> 
> This ability to span pages results in zsmalloc allocations not being
> directly addressable by the user.  The user is given an
> non-dereferencable handle in response to an allocation request.
> That handle must be mapped, using zs_map_object(), which returns
> a pointer to the mapped region that can be used.  The mapping is
> necessary since the object data may reside in two different
> noncontigious pages.
> 
> zsmalloc fulfills the allocation needs for zram and zswap.
> 
> Acked-by: Nitin Gupta 
> Acked-by: Minchan Kim 
> Signed-off-by: Seth Jennings 
> ---
>  include/linux/zsmalloc.h |   49 ++
>  mm/Kconfig   |   24 +
>  mm/Makefile  |1 +
>  mm/zsmalloc.c| 1124 
> ++
>  4 files changed, 1198 insertions(+)
>  create mode 100644 include/linux/zsmalloc.h
>  create mode 100644 mm/zsmalloc.c
> 
> diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
> new file mode 100644
> index 000..eb6efb6
> --- /dev/null
> +++ b/include/linux/zsmalloc.h
> @@ -0,0 +1,49 @@
> +/*
> + * zsmalloc memory allocator
> + *
> + * Copyright (C) 2011  Nitin Gupta
> + *
> + * This code is released using a dual license strategy: BSD/GPL
> + * You can choose the license that better fits your requirements.
> + *
> + * Released under the terms of 3-clause BSD License
> + * Released under the terms of GNU General Public License Version 2.0
> + */
> +
> +#ifndef _ZS_MALLOC_H_
> +#define _ZS_MALLOC_H_
> +
> +#include 
> +#include 
> +
> +/*
> + * zsmalloc mapping modes
> + *
> + * NOTE: These only make a difference when a mapped object spans pages
> +*/
> +enum zs_mapmode {
> + ZS_MM_RW, /* normal read-write mapping */
> + ZS_MM_RO, /* read-only (no copy-out at unmap time) */
> + ZS_MM_WO /* write-only (no copy-in at map time) */
> +};


These makes no difference for PGTABLE_MAPPING.
Please add some comment for this.

> +struct zs_ops {
> + struct page * (*alloc)(gfp_t);
> + void (*free)(struct page *);
> +};
> +
> +struct zs_pool;
> +
> +struct zs_pool *zs_create_pool(gfp_t flags, struct zs_ops *ops);
> +void zs_destroy_pool(struct zs_pool *pool);
> +
> +unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t flags);
> +void zs_free(struct zs_pool *pool, unsigned long obj);
> +
> +void *zs_map_object(struct zs_pool *pool, unsigned long handle,
> + enum zs_mapmode mm);
> +void zs_unmap_object(struct zs_pool *pool, unsigned long handle);
> +
> +u64 zs_get_total_size_bytes(struct zs_pool *pool);
> +
> +#endif
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 278e3ab..25b8f38 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -446,3 +446,27 @@ config FRONTSWAP
> and swap data is stored as normal on the matching swap device.
>  
> If unsure, say Y to enable frontswap.
> +
> +config ZSMALLOC
> + tristate "Memory allocator for compressed pages"
> + default n
> + help
> +   zsmalloc is a slab-based memory allocator designed to store
> +   compressed RAM pages.  zsmalloc uses virtual memory mapping
> +   in order to reduce fragmentation.  However, this results in a
> +   non-standard allocat

Re: [PATCHv5 1/8] zsmalloc: add to mm/

2013-02-22 Thread Joonsoo Kim

On Wed, Feb 20, 2013 at 08:37:33AM +0900, Minchan Kim wrote:
> On Tue, Feb 19, 2013 at 11:54:21AM -0600, Seth Jennings wrote:
> > On 02/19/2013 03:18 AM, Joonsoo Kim wrote:
> > > Hello, Seth.
> > > I'm not sure that this is right time to review, because I already have
> > > seen many effort of various people to promote zxxx series. I don't want to
> > > be a stopper to promote these. :)
> > 
> > Any time is good review time :)  Thanks for your review!
> > 
> > > 
> > > But, I read the code, now, and then some comments below.
> > > 
> > > On Wed, Feb 13, 2013 at 12:38:44PM -0600, Seth Jennings wrote:
> > >> =
> > >> DO NOT MERGE, FOR REVIEW ONLY
> > >> This patch introduces zsmalloc as new code, however, it already
> > >> exists in drivers/staging.  In order to build successfully, you
> > >> must select EITHER to driver/staging version OR this version.
> > >> Once zsmalloc is reviewed in this format (and hopefully accepted),
> > >> I will create a new patchset that properly promotes zsmalloc from
> > >> staging.
> > >> =
> > >>
> > >> This patchset introduces a new slab-based memory allocator,
> > >> zsmalloc, for storing compressed pages.  It is designed for
> > >> low fragmentation and high allocation success rate on
> > >> large object, but <= PAGE_SIZE allocations.
> > >>
> > >> zsmalloc differs from the kernel slab allocator in two primary
> > >> ways to achieve these design goals.
> > >>
> > >> zsmalloc never requires high order page allocations to back
> > >> slabs, or "size classes" in zsmalloc terms. Instead it allows
> > >> multiple single-order pages to be stitched together into a
> > >> "zspage" which backs the slab.  This allows for higher allocation
> > >> success rate under memory pressure.
> > >>
> > >> Also, zsmalloc allows objects to span page boundaries within the
> > >> zspage.  This allows for lower fragmentation than could be had
> > >> with the kernel slab allocator for objects between PAGE_SIZE/2
> > >> and PAGE_SIZE.  With the kernel slab allocator, if a page compresses
> > >> to 60% of it original size, the memory savings gained through
> > >> compression is lost in fragmentation because another object of
> > >> the same size can't be stored in the leftover space.
> > >>
> > >> This ability to span pages results in zsmalloc allocations not being
> > >> directly addressable by the user.  The user is given an
> > >> non-dereferencable handle in response to an allocation request.
> > >> That handle must be mapped, using zs_map_object(), which returns
> > >> a pointer to the mapped region that can be used.  The mapping is
> > >> necessary since the object data may reside in two different
> > >> noncontigious pages.
> > >>
> > >> zsmalloc fulfills the allocation needs for zram and zswap.
> > >>
> > >> Acked-by: Nitin Gupta 
> > >> Acked-by: Minchan Kim 
> > >> Signed-off-by: Seth Jennings 
> > >> ---
> > >>  include/linux/zsmalloc.h |   49 ++
> > >>  mm/Kconfig   |   24 +
> > >>  mm/Makefile  |1 +
> > >>  mm/zsmalloc.c| 1124 
> > >> ++
> > >>  4 files changed, 1198 insertions(+)
> > >>  create mode 100644 include/linux/zsmalloc.h
> > >>  create mode 100644 mm/zsmalloc.c
> > >>
> > >> diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
> > >> new file mode 100644
> > >> index 000..eb6efb6
> > >> --- /dev/null
> > >> +++ b/include/linux/zsmalloc.h
> > >> @@ -0,0 +1,49 @@
> > >> +/*
> > >> + * zsmalloc memory allocator
> > >> + *
> > >> + * Copyright (C) 2011  Nitin Gupta
> > >> + *
> > >> + * This code is released using a dual license strategy: BSD/GPL
> > >> + * You can choose the license that better fits your requirements.
> > >> + *
> > >> + * Released under the terms of 3-clause BSD License
> > >> + * Released under the terms of GNU General Public License Version 2.0
> > >> + */
> > >> +
> > >> +#ifndef _ZS_MALLOC_H_
> > >> +#define _ZS_MALLOC_H_
>

Re: [PATCHv6 1/8] zsmalloc: add to mm/

2013-02-22 Thread Joonsoo Kim

On Wed, Feb 20, 2013 at 04:04:41PM -0600, Seth Jennings wrote:
> =
> DO NOT MERGE, FOR REVIEW ONLY
> This patch introduces zsmalloc as new code, however, it already
> exists in drivers/staging.  In order to build successfully, you
> must select EITHER to driver/staging version OR this version.
> Once zsmalloc is reviewed in this format (and hopefully accepted),
> I will create a new patchset that properly promotes zsmalloc from
> staging.
> =
> 
> This patchset introduces a new slab-based memory allocator,
> zsmalloc, for storing compressed pages.  It is designed for
> low fragmentation and high allocation success rate on
> large object, but <= PAGE_SIZE allocations.
> 
> zsmalloc differs from the kernel slab allocator in two primary
> ways to achieve these design goals.
> 
> zsmalloc never requires high order page allocations to back
> slabs, or "size classes" in zsmalloc terms. Instead it allows
> multiple single-order pages to be stitched together into a
> "zspage" which backs the slab.  This allows for higher allocation
> success rate under memory pressure.

I have one more concern. It may be possibly stale question.
I think that zsmalloc makes system memories more fragmented.
Pages for zsmalloc can't be moved and may be spread at random location
because it's order is just 0, so high order allocation success rate
will be decreased. How do you think about it?

> Also, zsmalloc allows objects to span page boundaries within the
> zspage.  This allows for lower fragmentation than could be had
> with the kernel slab allocator for objects between PAGE_SIZE/2
> and PAGE_SIZE.  With the kernel slab allocator, if a page compresses
> to 60% of it original size, the memory savings gained through
> compression is lost in fragmentation because another object of
> the same size can't be stored in the leftover space.
> 
> This ability to span pages results in zsmalloc allocations not being
> directly addressable by the user.  The user is given an
> non-dereferencable handle in response to an allocation request.
> That handle must be mapped, using zs_map_object(), which returns
> a pointer to the mapped region that can be used.  The mapping is
> necessary since the object data may reside in two different
> noncontigious pages.
> 
> zsmalloc fulfills the allocation needs for zram and zswap.
> 
> Acked-by: Nitin Gupta 
> Acked-by: Minchan Kim 
> Signed-off-by: Seth Jennings 
> ---
>  include/linux/zsmalloc.h |   56 +++
>  mm/Kconfig   |   24 +
>  mm/Makefile  |1 +
>  mm/zsmalloc.c| 1117 
> ++
>  4 files changed, 1198 insertions(+)
>  create mode 100644 include/linux/zsmalloc.h
>  create mode 100644 mm/zsmalloc.c
> 
> diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
> new file mode 100644
> index 000..398dae3
> --- /dev/null
> +++ b/include/linux/zsmalloc.h
> @@ -0,0 +1,56 @@
> +/*
> + * zsmalloc memory allocator
> + *
> + * Copyright (C) 2011  Nitin Gupta
> + *
> + * This code is released using a dual license strategy: BSD/GPL
> + * You can choose the license that better fits your requirements.
> + *
> + * Released under the terms of 3-clause BSD License
> + * Released under the terms of GNU General Public License Version 2.0
> + */
> +
> +#ifndef _ZS_MALLOC_H_
> +#define _ZS_MALLOC_H_
> +
> +#include 
> +#include 
> +
> +/*
> + * zsmalloc mapping modes
> + *
> + * NOTE: These only make a difference when a mapped object spans pages.
> + *   They also have no effect when PGTABLE_MAPPING is selected.
> +*/
> +enum zs_mapmode {
> + ZS_MM_RW, /* normal read-write mapping */
> + ZS_MM_RO, /* read-only (no copy-out at unmap time) */
> + ZS_MM_WO /* write-only (no copy-in at map time) */
> + /*
> +  * NOTE: ZS_MM_WO should only be used for initializing new
> +  * (uninitialized) allocations.  Partial writes to already
> +  * initialized allocations should use ZS_MM_RW to preserve the
> +  * existing data.
> +  */
> +};
> +
> +struct zs_ops {
> + struct page * (*alloc)(gfp_t);
> + void (*free)(struct page *);
> +};
> +
> +struct zs_pool;
> +
> +struct zs_pool *zs_create_pool(gfp_t flags, struct zs_ops *ops);
> +void zs_destroy_pool(struct zs_pool *pool);
> +
> +unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t flags);
> +void zs_free(struct zs_pool *pool, unsigned long obj);
> +
> +void *zs_map_object(struct zs_pool *pool, unsigned long handle,
> + enum zs_mapmode mm);
> +void zs_unmap_object(struct zs_pool *pool, unsigned long handle);
> +
> +u64 zs_get_total_size_bytes(struct zs_pool *pool);
> +
> +#endif
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 278e3ab..25b8f38 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -446,3 +446,27 @@ config FRONTSWAP
> and swap data is stored as normal on the matching swap device.
>  
> If unsure, say Y to enable frontswap.
> +
> +config ZSMALLOC
> + tristate "Memory alloc

Re: [PATCH v2] slub: correctly bootstrap boot caches

2013-02-22 Thread JoonSoo Kim

2013/2/23 Christoph Lameter :
> On Fri, 22 Feb 2013, Glauber Costa wrote:
>
>> On 02/22/2013 09:01 PM, Christoph Lameter wrote:
>> > Argh. This one was the final version:
>> >
>> > https://patchwork.kernel.org/patch/2009521/
>> >
>>
>> It seems it would work. It is all the same to me.
>> Which one do you prefer?
>
> Flushing seems to be simpler and less code.
>

Hello, Christoph, Glauber.

With flushing, deactivate_slab() occur and it has some overhead to
deactivate objects.
If my patch properly fix this situation, it is better to use mine
which has no overhead.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 7/8] mm, vmalloc: export vmap_area_list, instead of vmlist

2013-03-12 Thread Joonsoo Kim

From: Joonsoo Kim 

Although our intention is to unexport internal structure entirely,
but there is one exception for kexec. kexec dumps address of vmlist
and makedumpfile uses this information.

We are about to remove vmlist, then another way to retrieve information
of vmalloc layer is needed for makedumpfile. For this purpose,
we export vmap_area_list, instead of vmlist.

Cc: Eric Biederman 
Cc: Dave Anderson 
Cc: Vivek Goyal 
Cc: Atsushi Kumagai 
Signed-off-by: Joonsoo Kim 
Signed-off-by: Joonsoo Kim 

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 698b1e5..8a25f90 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -130,8 +130,7 @@ extern long vwrite(char *buf, char *addr, unsigned long 
count);
 /*
  * Internals.  Dont't use..
  */
-extern rwlock_t vmlist_lock;
-extern struct vm_struct *vmlist;
+extern struct list_head vmap_area_list;
 extern __init void vm_area_add_early(struct vm_struct *vm);
 extern __init void vm_area_register_early(struct vm_struct *vm, size_t align);
 
diff --git a/kernel/kexec.c b/kernel/kexec.c
index bddd3d7..d9bfc6c 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -1489,7 +1489,7 @@ static int __init crash_save_vmcoreinfo_init(void)
VMCOREINFO_SYMBOL(swapper_pg_dir);
 #endif
VMCOREINFO_SYMBOL(_stext);
-   VMCOREINFO_SYMBOL(vmlist);
+   VMCOREINFO_SYMBOL(vmap_area_list);
 
 #ifndef CONFIG_NEED_MULTIPLE_NODES
VMCOREINFO_SYMBOL(mem_map);
diff --git a/mm/nommu.c b/mm/nommu.c
index e193280..ed82358 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -228,8 +228,7 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long 
address,
 }
 EXPORT_SYMBOL(follow_pfn);
 
-DEFINE_RWLOCK(vmlist_lock);
-struct vm_struct *vmlist;
+LIST_HEAD(vmap_area_list);
 
 void vfree(const void *addr)
 {
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index bda6cef..7e63984 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -261,7 +261,8 @@ struct vmap_area {
 };
 
 static DEFINE_SPINLOCK(vmap_area_lock);
-static LIST_HEAD(vmap_area_list);
+/* Export for kexec only */
+LIST_HEAD(vmap_area_list);
 static struct rb_root vmap_area_root = RB_ROOT;
 
 /* The vmap cache globals are protected by vmap_area_lock */
@@ -272,6 +273,10 @@ static unsigned long cached_align;
 
 static unsigned long vmap_area_pcpu_hole;
 
+/*** Old vmalloc interfaces ***/
+static DEFINE_RWLOCK(vmlist_lock);
+static struct vm_struct *vmlist;
+
 static struct vmap_area *__find_vmap_area(unsigned long addr)
 {
struct rb_node *n = vmap_area_root.rb_node;
@@ -1283,10 +1288,6 @@ int map_vm_area(struct vm_struct *area, pgprot_t prot, 
struct page ***pages)
 }
 EXPORT_SYMBOL_GPL(map_vm_area);
 
-/*** Old vmalloc interfaces ***/
-DEFINE_RWLOCK(vmlist_lock);
-struct vm_struct *vmlist;
-
 static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va,
  unsigned long flags, const void *caller)
 {
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 0/8] remove vm_struct list management

2013-03-12 Thread Joonsoo Kim

This patchset remove vm_struct list management after initializing vmalloc.
Adding and removing an entry to vmlist is linear time complexity, so
it is inefficient. If we maintain this list, overall time complexity of
adding and removing area to vmalloc space is O(N), although we use
rbtree for finding vacant place and it's time complexity is just O(logN).

And vmlist and vmlist_lock is used many places of outside of vmalloc.c.
It is preferable that we hide this raw data structure and provide
well-defined function for supporting them, because it makes that they
cannot mistake when manipulating theses structure and it makes us easily
maintain vmalloc layer.

For kexec and makedumpfile, I export vmap_area_list, instead of vmlist.
This comes from Atsushi's recommendation.
For more information, please refer below link.
https://lkml.org/lkml/2012/12/6/184

These are based on v3.9-rc2.

Changes from v1
5/8: skip areas for lazy_free
6/8: skip areas for lazy_free
7/8: export vmap_area_list for kexec, instead of vmlist

Joonsoo Kim (8):
  mm, vmalloc: change iterating a vmlist to find_vm_area()
  mm, vmalloc: move get_vmalloc_info() to vmalloc.c
  mm, vmalloc: protect va->vm by vmap_area_lock
  mm, vmalloc: iterate vmap_area_list, instead of vmlist in
vread/vwrite()
  mm, vmalloc: iterate vmap_area_list in get_vmalloc_info()
  mm, vmalloc: iterate vmap_area_list, instead of vmlist, in
vmallocinfo()
  mm, vmalloc: export vmap_area_list, instead of vmlist
  mm, vmalloc: remove list management of vmlist after initializing
vmalloc

 arch/tile/mm/pgtable.c  |7 +-
 arch/unicore32/mm/ioremap.c |   17 ++--
 arch/x86/mm/ioremap.c   |7 +-
 fs/proc/Makefile|2 +-
 fs/proc/internal.h  |   18 
 fs/proc/meminfo.c   |1 +
 fs/proc/mmu.c   |   60 -
 include/linux/vmalloc.h |   21 -
 kernel/kexec.c  |2 +-
 mm/nommu.c  |3 +-
 mm/vmalloc.c|  207 +--
 11 files changed, 170 insertions(+), 175 deletions(-)
 delete mode 100644 fs/proc/mmu.c

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 5/8] mm, vmalloc: iterate vmap_area_list in get_vmalloc_info()

2013-03-12 Thread Joonsoo Kim

From: Joonsoo Kim 

This patch is preparing step for removing vmlist entirely.
For above purpose, we change iterating a vmap_list codes to iterating a
vmap_area_list. It is somewhat trivial change, but just one thing
should be noticed.

vmlist is lack of information about some areas in vmalloc address space.
For example, vm_map_ram() allocate area in vmalloc address space,
but it doesn't make a link with vmlist. To provide full information about
vmalloc address space is better idea, so we don't use va->vm and use
vmap_area directly.
This makes get_vmalloc_info() more precise.

Signed-off-by: Joonsoo Kim 
Signed-off-by: Joonsoo Kim 

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 59aa328..aee1f61 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2671,46 +2671,50 @@ module_init(proc_vmalloc_init);
 
 void get_vmalloc_info(struct vmalloc_info *vmi)
 {
-   struct vm_struct *vma;
+   struct vmap_area *va;
unsigned long free_area_size;
unsigned long prev_end;
 
vmi->used = 0;
+   vmi->largest_chunk = 0;
 
-   if (!vmlist) {
-   vmi->largest_chunk = VMALLOC_TOTAL;
-   } else {
-   vmi->largest_chunk = 0;
+   prev_end = VMALLOC_START;
 
-   prev_end = VMALLOC_START;
-
-   read_lock(&vmlist_lock);
+   spin_lock(&vmap_area_lock);
 
-   for (vma = vmlist; vma; vma = vma->next) {
-   unsigned long addr = (unsigned long) vma->addr;
+   if (list_empty(&vmap_area_list)) {
+   vmi->largest_chunk = VMALLOC_TOTAL;
+   goto out;
+   }
 
-   /*
-* Some archs keep another range for modules in vmlist
-*/
-   if (addr < VMALLOC_START)
-   continue;
-   if (addr >= VMALLOC_END)
-   break;
+   list_for_each_entry(va, &vmap_area_list, list) {
+   unsigned long addr = va->va_start;
 
-   vmi->used += vma->size;
+   /*
+* Some archs keep another range for modules in vmalloc space
+*/
+   if (addr < VMALLOC_START)
+   continue;
+   if (addr >= VMALLOC_END)
+   break;
 
-   free_area_size = addr - prev_end;
-   if (vmi->largest_chunk < free_area_size)
-   vmi->largest_chunk = free_area_size;
+   if (va->flags & (VM_LAZY_FREE | VM_LAZY_FREEING))
+   continue;
 
-   prev_end = vma->size + addr;
-   }
+   vmi->used += (va->va_end - va->va_start);
 
-   if (VMALLOC_END - prev_end > vmi->largest_chunk)
-   vmi->largest_chunk = VMALLOC_END - prev_end;
+   free_area_size = addr - prev_end;
+   if (vmi->largest_chunk < free_area_size)
+   vmi->largest_chunk = free_area_size;
 
-   read_unlock(&vmlist_lock);
+   prev_end = va->va_end;
}
+
+   if (VMALLOC_END - prev_end > vmi->largest_chunk)
+   vmi->largest_chunk = VMALLOC_END - prev_end;
+
+out:
+   spin_unlock(&vmap_area_lock);
 }
 #endif
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 4/8] mm, vmalloc: iterate vmap_area_list, instead of vmlist in vread/vwrite()

2013-03-12 Thread Joonsoo Kim

From: Joonsoo Kim 

Now, when we hold a vmap_area_lock, va->vm can't be discarded. So we can
safely access to va->vm when iterating a vmap_area_list with holding a
vmap_area_lock. With this property, change iterating vmlist codes in
vread/vwrite() to iterating vmap_area_list.

There is a little difference relate to lock, because vmlist_lock is mutex,
but, vmap_area_lock is spin_lock. It may introduce a spinning overhead
during vread/vwrite() is executing. But, these are debug-oriented
functions, so this overhead is not real problem for common case.

Signed-off-by: Joonsoo Kim 
Signed-off-by: Joonsoo Kim 

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 1bf94ad..59aa328 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2012,7 +2012,8 @@ static int aligned_vwrite(char *buf, char *addr, unsigned 
long count)
 
 long vread(char *buf, char *addr, unsigned long count)
 {
-   struct vm_struct *tmp;
+   struct vmap_area *va;
+   struct vm_struct *vm;
char *vaddr, *buf_start = buf;
unsigned long buflen = count;
unsigned long n;
@@ -2021,10 +2022,17 @@ long vread(char *buf, char *addr, unsigned long count)
if ((unsigned long) addr + count < count)
count = -(unsigned long) addr;
 
-   read_lock(&vmlist_lock);
-   for (tmp = vmlist; count && tmp; tmp = tmp->next) {
-   vaddr = (char *) tmp->addr;
-   if (addr >= vaddr + tmp->size - PAGE_SIZE)
+   spin_lock(&vmap_area_lock);
+   list_for_each_entry(va, &vmap_area_list, list) {
+   if (!count)
+   break;
+
+   if (!(va->flags & VM_VM_AREA))
+   continue;
+
+   vm = va->vm;
+   vaddr = (char *) vm->addr;
+   if (addr >= vaddr + vm->size - PAGE_SIZE)
continue;
while (addr < vaddr) {
if (count == 0)
@@ -2034,10 +2042,10 @@ long vread(char *buf, char *addr, unsigned long count)
addr++;
count--;
}
-   n = vaddr + tmp->size - PAGE_SIZE - addr;
+   n = vaddr + vm->size - PAGE_SIZE - addr;
if (n > count)
n = count;
-   if (!(tmp->flags & VM_IOREMAP))
+   if (!(vm->flags & VM_IOREMAP))
aligned_vread(buf, addr, n);
else /* IOREMAP area is treated as memory hole */
memset(buf, 0, n);
@@ -2046,7 +2054,7 @@ long vread(char *buf, char *addr, unsigned long count)
count -= n;
}
 finished:
-   read_unlock(&vmlist_lock);
+   spin_unlock(&vmap_area_lock);
 
if (buf == buf_start)
return 0;
@@ -2085,7 +2093,8 @@ finished:
 
 long vwrite(char *buf, char *addr, unsigned long count)
 {
-   struct vm_struct *tmp;
+   struct vmap_area *va;
+   struct vm_struct *vm;
char *vaddr;
unsigned long n, buflen;
int copied = 0;
@@ -2095,10 +2104,17 @@ long vwrite(char *buf, char *addr, unsigned long count)
count = -(unsigned long) addr;
buflen = count;
 
-   read_lock(&vmlist_lock);
-   for (tmp = vmlist; count && tmp; tmp = tmp->next) {
-   vaddr = (char *) tmp->addr;
-   if (addr >= vaddr + tmp->size - PAGE_SIZE)
+   spin_lock(&vmap_area_lock);
+   list_for_each_entry(va, &vmap_area_list, list) {
+   if (!count)
+   break;
+
+   if (!(va->flags & VM_VM_AREA))
+   continue;
+
+   vm = va->vm;
+   vaddr = (char *) vm->addr;
+   if (addr >= vaddr + vm->size - PAGE_SIZE)
continue;
while (addr < vaddr) {
if (count == 0)
@@ -2107,10 +2123,10 @@ long vwrite(char *buf, char *addr, unsigned long count)
addr++;
count--;
}
-   n = vaddr + tmp->size - PAGE_SIZE - addr;
+   n = vaddr + vm->size - PAGE_SIZE - addr;
if (n > count)
n = count;
-   if (!(tmp->flags & VM_IOREMAP)) {
+   if (!(vm->flags & VM_IOREMAP)) {
aligned_vwrite(buf, addr, n);
copied++;
}
@@ -2119,7 +2135,7 @@ long vwrite(char *buf, char *addr, unsigned long count)
count -= n;
}
 finished:
-   read_unlock(&vmlist_lock);
+   spin_unlock(&vmap_area_lock);
if (!copied)
return 0;
return buflen;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 3/8] mm, vmalloc: protect va->vm by vmap_area_lock

2013-03-12 Thread Joonsoo Kim

From: Joonsoo Kim 

Inserting and removing an entry to vmlist is linear time complexity, so
it is inefficient. Following patches will try to remove vmlist entirely.
This patch is preparing step for it.

For removing vmlist, iterating vmlist codes should be changed to iterating
a vmap_area_list. Before implementing that, we should make sure that
when we iterate a vmap_area_list, accessing to va->vm doesn't cause a race
condition. This patch ensure that when iterating a vmap_area_list,
there is no race condition for accessing to vm_struct.

Signed-off-by: Joonsoo Kim 
Signed-off-by: Joonsoo Kim 

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 1d9878b..1bf94ad 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1290,12 +1290,14 @@ struct vm_struct *vmlist;
 static void setup_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va,
  unsigned long flags, const void *caller)
 {
+   spin_lock(&vmap_area_lock);
vm->flags = flags;
vm->addr = (void *)va->va_start;
vm->size = va->va_end - va->va_start;
vm->caller = caller;
va->vm = vm;
va->flags |= VM_VM_AREA;
+   spin_unlock(&vmap_area_lock);
 }
 
 static void insert_vmalloc_vmlist(struct vm_struct *vm)
@@ -1447,6 +1449,11 @@ struct vm_struct *remove_vm_area(const void *addr)
if (va && va->flags & VM_VM_AREA) {
struct vm_struct *vm = va->vm;
 
+   spin_lock(&vmap_area_lock);
+   va->vm = NULL;
+   va->flags &= ~VM_VM_AREA;
+   spin_unlock(&vmap_area_lock);
+
if (!(vm->flags & VM_UNLIST)) {
struct vm_struct *tmp, **p;
/*
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 8/8] mm, vmalloc: remove list management of vmlist after initializing vmalloc

2013-03-12 Thread Joonsoo Kim

From: Joonsoo Kim 

Now, there is no need to maintain vmlist after initializing vmalloc.
So remove related code and data structure.

Signed-off-by: Joonsoo Kim 
Signed-off-by: Joonsoo Kim 

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 7e63984..151da8a 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -273,10 +273,6 @@ static unsigned long cached_align;
 
 static unsigned long vmap_area_pcpu_hole;
 
-/*** Old vmalloc interfaces ***/
-static DEFINE_RWLOCK(vmlist_lock);
-static struct vm_struct *vmlist;
-
 static struct vmap_area *__find_vmap_area(unsigned long addr)
 {
struct rb_node *n = vmap_area_root.rb_node;
@@ -318,7 +314,7 @@ static void __insert_vmap_area(struct vmap_area *va)
rb_link_node(&va->rb_node, parent, p);
rb_insert_color(&va->rb_node, &vmap_area_root);
 
-   /* address-sort this list so it is usable like the vmlist */
+   /* address-sort this list */
tmp = rb_prev(&va->rb_node);
if (tmp) {
struct vmap_area *prev;
@@ -1130,6 +1126,7 @@ void *vm_map_ram(struct page **pages, unsigned int count, 
int node, pgprot_t pro
 }
 EXPORT_SYMBOL(vm_map_ram);
 
+static struct vm_struct *vmlist __initdata;
 /**
  * vm_area_add_early - add vmap area early during boot
  * @vm: vm_struct to add
@@ -1301,10 +1298,8 @@ static void setup_vmalloc_vm(struct vm_struct *vm, 
struct vmap_area *va,
spin_unlock(&vmap_area_lock);
 }
 
-static void insert_vmalloc_vmlist(struct vm_struct *vm)
+static void clear_vm_unlist(struct vm_struct *vm)
 {
-   struct vm_struct *tmp, **p;
-
/*
 * Before removing VM_UNLIST,
 * we should make sure that vm has proper values.
@@ -1312,22 +1307,13 @@ static void insert_vmalloc_vmlist(struct vm_struct *vm)
 */
smp_wmb();
vm->flags &= ~VM_UNLIST;
-
-   write_lock(&vmlist_lock);
-   for (p = &vmlist; (tmp = *p) != NULL; p = &tmp->next) {
-   if (tmp->addr >= vm->addr)
-   break;
-   }
-   vm->next = *p;
-   *p = vm;
-   write_unlock(&vmlist_lock);
 }
 
 static void insert_vmalloc_vm(struct vm_struct *vm, struct vmap_area *va,
  unsigned long flags, const void *caller)
 {
setup_vmalloc_vm(vm, va, flags, caller);
-   insert_vmalloc_vmlist(vm);
+   clear_vm_unlist(vm);
 }
 
 static struct vm_struct *__get_vm_area_node(unsigned long size,
@@ -1370,10 +1356,9 @@ static struct vm_struct *__get_vm_area_node(unsigned 
long size,
 
/*
 * When this function is called from __vmalloc_node_range,
-* we do not add vm_struct to vmlist here to avoid
-* accessing uninitialized members of vm_struct such as
-* pages and nr_pages fields. They will be set later.
-* To distinguish it from others, we use a VM_UNLIST flag.
+* we add VM_UNLIST flag to avoid accessing uninitialized
+* members of vm_struct such as pages and nr_pages fields.
+* They will be set later.
 */
if (flags & VM_UNLIST)
setup_vmalloc_vm(area, va, flags, caller);
@@ -1462,20 +1447,6 @@ struct vm_struct *remove_vm_area(const void *addr)
va->flags &= ~VM_VM_AREA;
spin_unlock(&vmap_area_lock);
 
-   if (!(vm->flags & VM_UNLIST)) {
-   struct vm_struct *tmp, **p;
-   /*
-* remove from list and disallow access to
-* this vm_struct before unmap. (address range
-* confliction is maintained by vmap.)
-*/
-   write_lock(&vmlist_lock);
-   for (p = &vmlist; (tmp = *p) != vm; p = &tmp->next)
-   ;
-   *p = tmp->next;
-   write_unlock(&vmlist_lock);
-   }
-
vmap_debug_free_range(va->va_start, va->va_end);
free_unmap_vmap_area(va);
vm->size -= PAGE_SIZE;
@@ -1695,10 +1666,11 @@ void *__vmalloc_node_range(unsigned long size, unsigned 
long align,
return NULL;
 
/*
-* In this function, newly allocated vm_struct is not added
-* to vmlist at __get_vm_area_node(). so, it is added here.
+* In this function, newly allocated vm_struct has VM_UNLIST flag.
+* It means that vm_struct is not fully initialized.
+* Now, it is fully initialized, so remove this flag here.
 */
-   insert_vmalloc_vmlist(area);
+   clear_vm_unlist(area);
 
/*
 * A ref_count = 3 is needed because the vm_struct and vmap_area
@@ -2594,7 +2566,7 @@ static void show_numa_info(struct seq_file *m, struct 
vm_struct *v)
if (!counters)
return;
 
-   /* Pair with smp_wm

[PATCH v2 6/8] mm, vmalloc: iterate vmap_area_list, instead of vmlist, in vmallocinfo()

2013-03-12 Thread Joonsoo Kim

From: Joonsoo Kim 

This patch is preparing step for removing vmlist entirely.
For above purpose, we change iterating a vmap_list codes to iterating a
vmap_area_list. It is somewhat trivial change, but just one thing
should be noticed.

Using vmap_area_list in vmallocinfo() introduce ordering problem in SMP
system. In s_show(), we retrieve some values from vm_struct. vm_struct's
values is not fully setup when va->vm is assigned. Full setup is notified
by removing VM_UNLIST flag without holding a lock. When we see that
VM_UNLIST is removed, it is not ensured that vm_struct has proper values
in view of other CPUs. So we need smp_[rw]mb for ensuring that proper
values is assigned when we see that VM_UNLIST is removed.

Therefore, this patch not only change a iteration list, but also add a
appropriate smp_[rw]mb to right places.

Signed-off-by: Joonsoo Kim 
Signed-off-by: Joonsoo Kim 

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index aee1f61..bda6cef 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1304,7 +1304,14 @@ static void insert_vmalloc_vmlist(struct vm_struct *vm)
 {
struct vm_struct *tmp, **p;
 
+   /*
+* Before removing VM_UNLIST,
+* we should make sure that vm has proper values.
+* Pair with smp_rmb() in show_numa_info().
+*/
+   smp_wmb();
vm->flags &= ~VM_UNLIST;
+
write_lock(&vmlist_lock);
for (p = &vmlist; (tmp = *p) != NULL; p = &tmp->next) {
if (tmp->addr >= vm->addr)
@@ -2542,19 +2549,19 @@ void pcpu_free_vm_areas(struct vm_struct **vms, int 
nr_vms)
 
 #ifdef CONFIG_PROC_FS
 static void *s_start(struct seq_file *m, loff_t *pos)
-   __acquires(&vmlist_lock)
+   __acquires(&vmap_area_lock)
 {
loff_t n = *pos;
-   struct vm_struct *v;
+   struct vmap_area *va;
 
-   read_lock(&vmlist_lock);
-   v = vmlist;
-   while (n > 0 && v) {
+   spin_lock(&vmap_area_lock);
+   va = list_entry((&vmap_area_list)->next, typeof(*va), list);
+   while (n > 0 && &va->list != &vmap_area_list) {
n--;
-   v = v->next;
+   va = list_entry(va->list.next, typeof(*va), list);
}
-   if (!n)
-   return v;
+   if (!n && &va->list != &vmap_area_list)
+   return va;
 
return NULL;
 
@@ -2562,16 +2569,20 @@ static void *s_start(struct seq_file *m, loff_t *pos)
 
 static void *s_next(struct seq_file *m, void *p, loff_t *pos)
 {
-   struct vm_struct *v = p;
+   struct vmap_area *va = p, *next;
 
++*pos;
-   return v->next;
+   next = list_entry(va->list.next, typeof(*va), list);
+   if (&next->list != &vmap_area_list)
+   return next;
+
+   return NULL;
 }
 
 static void s_stop(struct seq_file *m, void *p)
-   __releases(&vmlist_lock)
+   __releases(&vmap_area_lock)
 {
-   read_unlock(&vmlist_lock);
+   spin_unlock(&vmap_area_lock);
 }
 
 static void show_numa_info(struct seq_file *m, struct vm_struct *v)
@@ -2582,6 +2593,11 @@ static void show_numa_info(struct seq_file *m, struct 
vm_struct *v)
if (!counters)
return;
 
+   /* Pair with smp_wmb() in insert_vmalloc_vmlist() */
+   smp_rmb();
+   if (v->flags & VM_UNLIST)
+   return;
+
memset(counters, 0, nr_node_ids * sizeof(unsigned int));
 
for (nr = 0; nr < v->nr_pages; nr++)
@@ -2595,7 +2611,20 @@ static void show_numa_info(struct seq_file *m, struct 
vm_struct *v)
 
 static int s_show(struct seq_file *m, void *p)
 {
-   struct vm_struct *v = p;
+   struct vmap_area *va = p;
+   struct vm_struct *v;
+
+   if (va->flags & (VM_LAZY_FREE | VM_LAZY_FREEING))
+   return 0;
+
+   if (!(va->flags & VM_VM_AREA)) {
+   seq_printf(m, "0x%pK-0x%pK %7ld vm_map_ram\n",
+   (void *)va->va_start, (void *)va->va_end,
+   va->va_end - va->va_start);
+   return 0;
+   }
+
+   v = va->vm;
 
seq_printf(m, "0x%pK-0x%pK %7ld",
v->addr, v->addr + v->size, v->size);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 1/8] mm, vmalloc: change iterating a vmlist to find_vm_area()

2013-03-12 Thread Joonsoo Kim

From: Joonsoo Kim 

The purpose of iterating a vmlist is finding vm area with specific
virtual address. find_vm_area() is provided for this purpose
and more efficient, because it uses a rbtree.
So change it.

Cc: Chris Metcalf 
Cc: Guan Xuetao 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Acked-by: Guan Xuetao 
Acked-by: Ingo Molnar 
Acked-by: Chris Metcalf 
Signed-off-by: Joonsoo Kim 
Signed-off-by: Joonsoo Kim 

diff --git a/arch/tile/mm/pgtable.c b/arch/tile/mm/pgtable.c
index b3b4972..dfd63ce 100644
--- a/arch/tile/mm/pgtable.c
+++ b/arch/tile/mm/pgtable.c
@@ -592,12 +592,7 @@ void iounmap(volatile void __iomem *addr_in)
   in parallel. Reuse of the virtual address is prevented by
   leaving it in the global lists until we're done with it.
   cpa takes care of the direct mappings. */
-   read_lock(&vmlist_lock);
-   for (p = vmlist; p; p = p->next) {
-   if (p->addr == addr)
-   break;
-   }
-   read_unlock(&vmlist_lock);
+   p = find_vm_area((void *)addr);
 
if (!p) {
pr_err("iounmap: bad address %p\n", addr);
diff --git a/arch/unicore32/mm/ioremap.c b/arch/unicore32/mm/ioremap.c
index b7a6055..13068ee 100644
--- a/arch/unicore32/mm/ioremap.c
+++ b/arch/unicore32/mm/ioremap.c
@@ -235,7 +235,7 @@ EXPORT_SYMBOL(__uc32_ioremap_cached);
 void __uc32_iounmap(volatile void __iomem *io_addr)
 {
void *addr = (void *)(PAGE_MASK & (unsigned long)io_addr);
-   struct vm_struct **p, *tmp;
+   struct vm_struct *vm;
 
/*
 * If this is a section based mapping we need to handle it
@@ -244,17 +244,10 @@ void __uc32_iounmap(volatile void __iomem *io_addr)
 * all the mappings before the area can be reclaimed
 * by someone else.
 */
-   write_lock(&vmlist_lock);
-   for (p = &vmlist ; (tmp = *p) ; p = &tmp->next) {
-   if ((tmp->flags & VM_IOREMAP) && (tmp->addr == addr)) {
-   if (tmp->flags & VM_UNICORE_SECTION_MAPPING) {
-   unmap_area_sections((unsigned long)tmp->addr,
-   tmp->size);
-   }
-   break;
-   }
-   }
-   write_unlock(&vmlist_lock);
+   vm = find_vm_area(addr);
+   if (vm && (vm->flags & VM_IOREMAP) &&
+   (vm->flags & VM_UNICORE_SECTION_MAPPING))
+   unmap_area_sections((unsigned long)vm->addr, vm->size);
 
vunmap(addr);
 }
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 78fe3f1..9a1e658 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -282,12 +282,7 @@ void iounmap(volatile void __iomem *addr)
   in parallel. Reuse of the virtual address is prevented by
   leaving it in the global lists until we're done with it.
   cpa takes care of the direct mappings. */
-   read_lock(&vmlist_lock);
-   for (p = vmlist; p; p = p->next) {
-   if (p->addr == (void __force *)addr)
-   break;
-   }
-   read_unlock(&vmlist_lock);
+   p = find_vm_area((void __force *)addr);
 
if (!p) {
printk(KERN_ERR "iounmap: bad address %p\n", addr);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 2/8] mm, vmalloc: move get_vmalloc_info() to vmalloc.c

2013-03-12 Thread Joonsoo Kim

From: Joonsoo Kim 

Now get_vmalloc_info() is in fs/proc/mmu.c. There is no reason
that this code must be here and it's implementation needs vmlist_lock
and it iterate a vmlist which may be internal data structure for vmalloc.

It is preferable that vmlist_lock and vmlist is only used in vmalloc.c
for maintainability. So move the code to vmalloc.c

Signed-off-by: Joonsoo Kim 
Signed-off-by: Joonsoo Kim 

diff --git a/fs/proc/Makefile b/fs/proc/Makefile
index 712f24d..ab30716 100644
--- a/fs/proc/Makefile
+++ b/fs/proc/Makefile
@@ -5,7 +5,7 @@
 obj-y   += proc.o
 
 proc-y := nommu.o task_nommu.o
-proc-$(CONFIG_MMU) := mmu.o task_mmu.o
+proc-$(CONFIG_MMU) := task_mmu.o
 
 proc-y   += inode.o root.o base.o generic.o array.o \
fd.o
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 85ff3a4..7571035 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -30,24 +30,6 @@ extern int proc_net_init(void);
 static inline int proc_net_init(void) { return 0; }
 #endif
 
-struct vmalloc_info {
-   unsigned long   used;
-   unsigned long   largest_chunk;
-};
-
-#ifdef CONFIG_MMU
-#define VMALLOC_TOTAL (VMALLOC_END - VMALLOC_START)
-extern void get_vmalloc_info(struct vmalloc_info *vmi);
-#else
-
-#define VMALLOC_TOTAL 0UL
-#define get_vmalloc_info(vmi)  \
-do {   \
-   (vmi)->used = 0;\
-   (vmi)->largest_chunk = 0;   \
-} while(0)
-#endif
-
 extern int proc_tid_stat(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task);
 extern int proc_tgid_stat(struct seq_file *m, struct pid_namespace *ns,
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 1efaaa1..5aa847a 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include "internal.h"
diff --git a/fs/proc/mmu.c b/fs/proc/mmu.c
deleted file mode 100644
index 8ae221d..000
--- a/fs/proc/mmu.c
+++ /dev/null
@@ -1,60 +0,0 @@
-/* mmu.c: mmu memory info files
- *
- * Copyright (C) 2004 Red Hat, Inc. All Rights Reserved.
- * Written by David Howells (dhowe...@redhat.com)
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public License
- * as published by the Free Software Foundation; either version
- * 2 of the License, or (at your option) any later version.
- */
-#include 
-#include 
-#include 
-#include 
-#include "internal.h"
-
-void get_vmalloc_info(struct vmalloc_info *vmi)
-{
-   struct vm_struct *vma;
-   unsigned long free_area_size;
-   unsigned long prev_end;
-
-   vmi->used = 0;
-
-   if (!vmlist) {
-   vmi->largest_chunk = VMALLOC_TOTAL;
-   }
-   else {
-   vmi->largest_chunk = 0;
-
-   prev_end = VMALLOC_START;
-
-   read_lock(&vmlist_lock);
-
-   for (vma = vmlist; vma; vma = vma->next) {
-   unsigned long addr = (unsigned long) vma->addr;
-
-   /*
-* Some archs keep another range for modules in vmlist
-*/
-   if (addr < VMALLOC_START)
-   continue;
-   if (addr >= VMALLOC_END)
-   break;
-
-   vmi->used += vma->size;
-
-   free_area_size = addr - prev_end;
-   if (vmi->largest_chunk < free_area_size)
-   vmi->largest_chunk = free_area_size;
-
-   prev_end = vma->size + addr;
-   }
-
-   if (VMALLOC_END - prev_end > vmi->largest_chunk)
-   vmi->largest_chunk = VMALLOC_END - prev_end;
-
-   read_unlock(&vmlist_lock);
-   }
-}
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 6071e91..698b1e5 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -158,4 +158,22 @@ pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms)
 # endif
 #endif
 
+struct vmalloc_info {
+   unsigned long   used;
+   unsigned long   largest_chunk;
+};
+
+#ifdef CONFIG_MMU
+#define VMALLOC_TOTAL (VMALLOC_END - VMALLOC_START)
+extern void get_vmalloc_info(struct vmalloc_info *vmi);
+#else
+
+#define VMALLOC_TOTAL 0UL
+#define get_vmalloc_info(vmi)  \
+do {   \
+   (vmi)->used = 0;\
+   (vmi)->largest_chunk = 0;   \
+} while (0)
+#endif
+
 #endif /* _LINUX_VMALLOC_H */
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 0f751f2..1d9878b 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2645,5 +2645,49 @@ static int __init proc_vmalloc_init(void)

Re: [PATCH] mm: page_alloc: remove branch operation in free_pages_prepare()

2013-03-13 Thread Joonsoo Kim

Hello, Hugh.

On Thu, Mar 07, 2013 at 06:01:26PM -0800, Hugh Dickins wrote:
> On Fri, 8 Mar 2013, Joonsoo Kim wrote:
> > On Thu, Mar 07, 2013 at 10:54:15AM -0800, Hugh Dickins wrote:
> > > On Thu, 7 Mar 2013, Joonsoo Kim wrote:
> > > 
> > > > When we found that the flag has a bit of PAGE_FLAGS_CHECK_AT_PREP,
> > > > we reset the flag. If we always reset the flag, we can reduce one
> > > > branch operation. So remove it.
> > > > 
> > > > Cc: Hugh Dickins 
> > > > Signed-off-by: Joonsoo Kim 
> > > 
> > > I don't object to this patch.  But certainly I would have written it
> > > that way in order not to dirty a cacheline unnecessarily.  It may be
> > > obvious to you that the cacheline in question is almost always already
> > > dirty, and the branch almost always more expensive.  But I'll leave that
> > > to you, and to those who know more about these subtle costs than I do.
> > 
> > Yes. I already think about that. I thought that even if a cacheline is
> > not dirty at this time, we always touch the 'struct page' in
> > set_freepage_migratetype() a little later, so dirtying is not the problem.
> 
> I expect that a very high proportion of user pages have
> PG_uptodate to be cleared here; and there's also the recently added
> page_nid_reset_last(), which will dirty the flags or a nearby field
> when CONFIG_NUMA_BALANCING.  Those argue in favour of your patch.
> 

Ah... I totally missed it.

> > 
> > But, now, I re-think this and decide to drop this patch.
> > The reason is that 'struct page' of 'compound pages' may not be dirty
> > at this time and will not be dirty at later time.
> 
> Actual compound pages would have PG_head or PG_tail or PG_compound
> to be cleared there, I believe (check if I'm right on that).  The
> questionable case is the ordinary order>0 case without __GFP_COMP
> (and page_nid_reset_last() is applied to each subpage of those).
>

Yes.

> > So this patch is bad idea.
> 
> I'm not so sure.  I doubt your patch will make a giant improvement
> in kernel performance!  But it might make a little - maybe you just
> need to give some numbers from perf to justify it (but I'm easily
> dazzled by numbers - don't expect me to judge the result).

Okay.
Thanks for enlightening comment.
Now, I don't have any idea to collect a performance result for this patch.
When I have more time, I try to think it.

Thanks.

> 
> Hugh
> 
> > 
> > Is there any comments?
> > 
> > Thanks.
> > 
> > > Hugh
> > > 
> > > > 
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index 8fcced7..778f2a9 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -614,8 +614,7 @@ static inline int free_pages_check(struct page 
> > > > *page)
> > > > return 1;
> > > > }
> > > > page_nid_reset_last(page);
> > > > -   if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
> > > > -   page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> > > > +   page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> > > > return 0;
> > > >  }
> > > >  
> > > 
> > > --
> > > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > > the body to majord...@kvack.org.  For more info on Linux MM,
> > > see: http://www.linux-mm.org/ .
> > > Don't email: mailto:"d...@kvack.org";> em...@kvack.org 
> > 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] slab: move kmem_cache_free to common code

2012-10-23 Thread JoonSoo Kim

2012/10/23 Glauber Costa :
> On 10/23/2012 12:07 PM, Glauber Costa wrote:
>> On 10/23/2012 04:48 AM, JoonSoo Kim wrote:
>>> Hello, Glauber.
>>>
>>> 2012/10/23 Glauber Costa :
>>>> On 10/22/2012 06:45 PM, Christoph Lameter wrote:
>>>>> On Mon, 22 Oct 2012, Glauber Costa wrote:
>>>>>
>>>>>> + * kmem_cache_free - Deallocate an object
>>>>>> + * @cachep: The cache the allocation was from.
>>>>>> + * @objp: The previously allocated object.
>>>>>> + *
>>>>>> + * Free an object which was previously allocated from this
>>>>>> + * cache.
>>>>>> + */
>>>>>> +void kmem_cache_free(struct kmem_cache *s, void *x)
>>>>>> +{
>>>>>> +__kmem_cache_free(s, x);
>>>>>> +trace_kmem_cache_free(_RET_IP_, x);
>>>>>> +}
>>>>>> +EXPORT_SYMBOL(kmem_cache_free);
>>>>>> +
>>>>>
>>>>> This results in an additional indirection if tracing is off. Wonder if
>>>>> there is a performance impact?
>>>>>
>>>> if tracing is on, you mean?
>>>>
>>>> Tracing already incurs overhead, not sure how much a function call would
>>>> add to the tracing overhead.
>>>>
>>>> I would not be concerned with this, but I can measure, if you have any
>>>> specific workload in mind.
>>>
>>> With this patch, kmem_cache_free() invokes __kmem_cache_free(),
>>> that is, it add one more "call instruction" than before.
>>>
>>> I think that Christoph's comment means above fact.
>>
>> Ah, this. Ok, I got fooled by his mention to tracing.
>>
>> I do agree, but since freeing is ultimately dependent on the allocator
>> layout, I don't see a clean way of doing this without dropping tears of
>> sorrow around. The calls in slub/slab/slob would have to be somehow
>> inlined. Hum... maybe it is possible to do it from
>> include/linux/sl*b_def.h...
>>
>> Let me give it a try and see what I can come up with.
>>
>
> Ok.
>
> I am attaching a PoC for this for your appreciation. This gets quite
> ugly, but it's the way I found without including sl{a,u,o}b.c directly -
> which would be even worse.

Hmm...
This is important issue for sl[aou]b common allocators.
Because there are similar functions like as kmem_cache_alloc, ksize, kfree, ...
So it is good time to resolve this issue.

As far as I know, now, we have 3 solutions.

1. include/linux/slab.h
__always_inline kmem_cache_free()
{
__kmem_cache_free();
blablabla...
}

2. define macro like as Glauber's solution
3. include sl[aou]b.c directly.

Is there other good solution?
Among them, I prefer "solution 3", because future developing cost may
be minimum among them.

"Solution 2" may be error-prone for future developing.
"Solution 1" may make compile-time longer and larger code.

Is my understanding right?
Is "Solution 3" really ugly?

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for-v3.7 2/2] slub: optimize kmalloc* inlining for GFP_DMA

2012-10-23 Thread JoonSoo Kim

Hi, Eric.

2012/10/23 Eric Dumazet :
> On Tue, 2012-10-23 at 11:29 +0900, JoonSoo Kim wrote:
>> 2012/10/22 Christoph Lameter :
>> > On Sun, 21 Oct 2012, Joonsoo Kim wrote:
>> >
>> >> kmalloc() and kmalloc_node() of the SLUB isn't inlined when @flags = 
>> >> __GFP_DMA.
>> >> This patch optimize this case,
>> >> so when @flags = __GFP_DMA, it will be inlined into generic code.
>> >
>> > __GFP_DMA is a rarely used flag for kmalloc allocators and so far it was
>> > not considered that it is worth to directly support it in the inlining
>> > code.
>> >
>> >
>>
>> Hmm... but, the SLAB already did that optimization for __GFP_DMA.
>> Almost every kmalloc() is invoked with constant flags value,
>> so I think that overhead from this patch may be negligible.
>> With this patch, code size of vmlinux is reduced slightly.
>
> Only because you asked a allyesconfig
>
> GFP_DMA is used for less than 0.1 % of kmalloc() calls, for legacy
> hardware (from last century)

I'm not doing with allyesconfig,
but localmodconfig on my ubuntu desktop system.
On my system, 700 bytes of text of vmlinux is reduced
which mean there may be more than 100 callsite with GFP_DMA.

> In fact if you want to reduce even more your vmlinux, you could test
>
> if (__builtin_constant_p(flags) && (flags & SLUB_DMA))
> return kmem_cache_alloc_trace(s, flags, size);
>
> to force the call to out of line code.

The reason why I mention about code size is that I want to say it may
be good for performance,
although it has a just small impact.
I'm not interest of reducing code size :)

Thanks for comment.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 05/18] slab/slub: struct memcg_params

2012-10-23 Thread JoonSoo Kim

Hi, Glauber.

2012/10/19 Glauber Costa :
> For the kmem slab controller, we need to record some extra
> information in the kmem_cache structure.
>
> Signed-off-by: Glauber Costa 
> Signed-off-by: Suleiman Souhlal 
> CC: Christoph Lameter 
> CC: Pekka Enberg 
> CC: Michal Hocko 
> CC: Kamezawa Hiroyuki 
> CC: Johannes Weiner 
> CC: Tejun Heo 
> ---
>  include/linux/slab.h | 25 +
>  include/linux/slab_def.h |  3 +++
>  include/linux/slub_def.h |  3 +++
>  mm/slab.h| 13 +
>  4 files changed, 44 insertions(+)
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 0dd2dfa..e4ea48a 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -177,6 +177,31 @@ unsigned int kmem_cache_size(struct kmem_cache *);
>  #define ARCH_SLAB_MINALIGN __alignof__(unsigned long long)
>  #endif
>
> +#include 

Why workqueue.h is includede at this time?
It may be future use, so is it better to add it later?
Adding it at proper time makes git blame works better.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 06/18] consider a memcg parameter in kmem_create_cache

2012-10-23 Thread JoonSoo Kim

2012/10/19 Glauber Costa :

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6a1e096..59f6d54 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -339,6 +339,12 @@ struct mem_cgroup {
>  #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
> struct tcp_memcontrol tcp_mem;
>  #endif
> +#if defined(CONFIG_MEMCG_KMEM)
> +   /* analogous to slab_common's slab_caches list. per-memcg */
> +   struct list_head memcg_slab_caches;
> +   /* Not a spinlock, we can take a lot of time walking the list */
> +   struct mutex slab_caches_mutex;
> +#endif
>  };

It is better to change name of "slab_caches_mutex to something
representing slab cache mutex of memcg.

> +int memcg_register_cache(struct mem_cgroup *memcg, struct kmem_cache *s)
> +{
> +   size_t size = sizeof(struct memcg_cache_params);
> +
> +   if (!memcg_kmem_enabled())
> +   return 0;
> +
> +   s->memcg_params = kzalloc(size, GFP_KERNEL);
> +   if (!s->memcg_params)
> +   return -ENOMEM;
> +
> +   if (memcg)
> +   s->memcg_params->memcg = memcg;
> +   return 0;
> +}

Now, I don't read full-patchset and I have not enough knowledge about cgroup.
So I have a question.
When memcg_kmem_enable, creation kmem_cache of normal user(not
included in cgroup) also allocate memcg_params.
Is it right behavior?

> +void memcg_release_cache(struct kmem_cache *s)
> +{
> +   kfree(s->memcg_params);
> +}
> +
>  /*
>   * We need to verify if the allocation against current->mm->owner's memcg is
>   * possible for the given order. But the page is not allocated yet, so we'll
> diff --git a/mm/slab.h b/mm/slab.h
> index 5ee1851..c35ecce 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -35,12 +35,15 @@ extern struct kmem_cache *kmem_cache;
>  /* Functions provided by the slab allocators */
>  extern int __kmem_cache_create(struct kmem_cache *, unsigned long flags);
>
> +struct mem_cgroup;
>  #ifdef CONFIG_SLUB
> -struct kmem_cache *__kmem_cache_alias(const char *name, size_t size,
> -   size_t align, unsigned long flags, void (*ctor)(void *));
> +struct kmem_cache *
> +__kmem_cache_alias(struct mem_cgroup *memcg, const char *name, size_t size,
> +  size_t align, unsigned long flags, void (*ctor)(void *));
>  #else
> -static inline struct kmem_cache *__kmem_cache_alias(const char *name, size_t 
> size,
> -   size_t align, unsigned long flags, void (*ctor)(void *))
> +static inline struct kmem_cache *
> +__kmem_cache_alias(struct mem_cgroup *memcg, const char *name, size_t size,
> +  size_t align, unsigned long flags, void (*ctor)(void *))
>  { return NULL; }
>  #endif
>
> @@ -98,11 +101,23 @@ static inline bool is_root_cache(struct kmem_cache *s)
>  {
> return !s->memcg_params || s->memcg_params->is_root_cache;
>  }
> +
> +static inline bool cache_match_memcg(struct kmem_cache *cachep,
> +struct mem_cgroup *memcg)
> +{
> +   return (is_root_cache(cachep) && !memcg) ||
> +   (cachep->memcg_params->memcg == memcg);
> +}

When cachep->memcg_params == NULL and memcg is not NULL, NULL pointer
deref may be occurred.
Is there no situation like above?

>  #else
>  static inline bool is_root_cache(struct kmem_cache *s)
>  {
> return true;
>  }
>
> +static inline bool cache_match_memcg(struct kmem_cache *cachep,
> +struct mem_cgroup *memcg)
> +{
> +   return true;
> +}
>  #endif
>  #endif
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index bf4b4f1..f97f7b8 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -18,6 +18,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include "slab.h"
>
> @@ -27,7 +28,8 @@ DEFINE_MUTEX(slab_mutex);
>  struct kmem_cache *kmem_cache;
>
>  #ifdef CONFIG_DEBUG_VM
> -static int kmem_cache_sanity_check(const char *name, size_t size)
> +static int kmem_cache_sanity_check(struct mem_cgroup *memcg, const char 
> *name,
> +  size_t size)
>  {
> struct kmem_cache *s = NULL;
>
> @@ -53,7 +55,7 @@ static int kmem_cache_sanity_check(const char *name, size_t 
> size)
> continue;
> }
>
> -   if (!strcmp(s->name, name)) {
> +   if (cache_match_memcg(s, memcg) && !strcmp(s->name, name)) {
> pr_err("%s (%s): Cache name already exists.\n",
>__func__, name);
> dump_stack();
> @@ -66,7 +68,8 @@ static int kmem_cache_sanity_check(const char *name, size_t 
> size)
> return 0;
>  }
>  #else
> -static inline int kmem_cache_sanity_check(const char *name, size_t size)
> +static inline int kmem_cache_sanity_check(struct mem_cgroup *memcg,
> + const char *name, size_t size)
>  {
> return 0;
>  }
> @@ -97,8 +100,9 @@ static inline int kmem_cache_sanity_check(const char 
> *name, s

Re: [PATCH v2 1/2] kmem_cache: include allocators code directly into slab_common

2012-10-24 Thread JoonSoo Kim

2012/10/24 Glauber Costa :
> On 10/24/2012 06:29 PM, Christoph Lameter wrote:
>> On Wed, 24 Oct 2012, Glauber Costa wrote:
>>
>>> Because of that, we either have to move all the entry points to the
>>> mm/slab.h and rely heavily on the pre-processor, or include all .c files
>>> in here.
>>
>> Hmm... That is a bit of a radical solution. The global optimizations now
>> possible with the new gcc compiler include the ability to fold functions
>> across different linkable objects. Andi, is that usable for kernel builds?
>>
>
> In general, it takes quite a lot of time to take all those optimizations
> for granted. We still live a lot of time with multiple compiler versions
> building distros, etc, for quite some time.
>
> I would expect the end result for anyone not using such a compiler to be
> a sudden performance drop when using a new kernel. Not really pleasant.

I agree with Glauber's opinion.
And patch looks fine to me.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 08/18] memcg: infrastructure to match an allocation to the right cache

2012-10-24 Thread JoonSoo Kim

2012/10/19 Glauber Costa :
> @@ -2930,9 +2937,188 @@ int memcg_register_cache(struct mem_cgroup *memcg, 
> struct kmem_cache *s)
>
>  void memcg_release_cache(struct kmem_cache *s)
>  {
> +   struct kmem_cache *root;
> +   int id = memcg_css_id(s->memcg_params->memcg);
> +
> +   if (s->memcg_params->is_root_cache)
> +   goto out;
> +
> +   root = s->memcg_params->root_cache;
> +   root->memcg_params->memcg_caches[id] = NULL;
> +   mem_cgroup_put(s->memcg_params->memcg);
> +out:
> kfree(s->memcg_params);
>  }

memcg_css_id should be called after checking "s->memcg_params->is_root_cache".
Because when is_root_cache == true, memcg_params has no memcg object.


> +/*
> + * This lock protects updaters, not readers. We want readers to be as fast as
> + * they can, and they will either see NULL or a valid cache value. Our model
> + * allow them to see NULL, in which case the root memcg will be selected.
> + *
> + * We need this lock because multiple allocations to the same cache from a 
> non
> + * GFP_WAIT area will span more than one worker. Only one of them can create
> + * the cache.
> + */
> +static DEFINE_MUTEX(memcg_cache_mutex);
> +static struct kmem_cache *memcg_create_kmem_cache(struct mem_cgroup *memcg,
> + struct kmem_cache *cachep)
> +{
> +   struct kmem_cache *new_cachep;
> +   int idx;
> +
> +   BUG_ON(!memcg_can_account_kmem(memcg));
> +
> +   idx = memcg_css_id(memcg);
> +
> +   mutex_lock(&memcg_cache_mutex);
> +   new_cachep = cachep->memcg_params->memcg_caches[idx];
> +   if (new_cachep)
> +   goto out;
> +
> +   new_cachep = kmem_cache_dup(memcg, cachep);
> +
> +   if (new_cachep == NULL) {
> +   new_cachep = cachep;
> +   goto out;
> +   }
> +
> +   mem_cgroup_get(memcg);
> +   cachep->memcg_params->memcg_caches[idx] = new_cachep;
> +   wmb(); /* the readers won't lock, make sure everybody sees it */

Is there any rmb() pair?
As far as I know, without rmb(), wmb() doesn't guarantee anything.

> +   new_cachep->memcg_params->memcg = memcg;
> +   new_cachep->memcg_params->root_cache = cachep;

It may be better these assignment before the statement
"cachep->memcg_params->memcg_caches[idx] = new_cachep".
Otherwise, it may produce race situation.

And assigning value to memcg_params->memcg and root_cache is redundant,
because it is already done in memcg_register_cache().

> +/*
> + * Return the kmem_cache we're supposed to use for a slab allocation.
> + * We try to use the current memcg's version of the cache.
> + *
> + * If the cache does not exist yet, if we are the first user of it,
> + * we either create it immediately, if possible, or create it asynchronously
> + * in a workqueue.
> + * In the latter case, we will let the current allocation go through with
> + * the original cache.
> + *
> + * Can't be called in interrupt context or from kernel threads.
> + * This function needs to be called with rcu_read_lock() held.
> + */
> +struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep,
> + gfp_t gfp)
> +{
> +   struct mem_cgroup *memcg;
> +   int idx;
> +
> +   if (cachep->memcg_params && cachep->memcg_params->memcg)
> +   return cachep;

In __memcg_kmem_get_cache, cachep may be always root cache.
So checking "cachep->memcg_params->memcg" is somewhat strange.
Is it right?


Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/5] mm, highmem: remove useless pool_lock

2012-10-31 Thread JoonSoo Kim

Hello, Andrew.

2012/10/31 Andrew Morton :
> On Mon, 29 Oct 2012 04:12:53 +0900
> Joonsoo Kim  wrote:
>
>> The pool_lock protects the page_address_pool from concurrent access.
>> But, access to the page_address_pool is already protected by kmap_lock.
>> So remove it.
>
> Well, there's a set_page_address() call in mm/page_alloc.c which
> doesn't have lock_kmap().  it doesn't *need* lock_kmap() because it's
> init-time code and we're running single-threaded there.  I hope!
>
> But this exception should be double-checked and mentioned in the
> changelog, please.  And it's a reason why we can't add
> assert_spin_locked(&kmap_lock) to set_page_address(), which is
> unfortunate.

set_page_address() in mm/page_alloc.c is invoked only when
WANT_PAGE_VIRTUAL is defined.
And in this case, set_page_address()'s definition is not in highmem.c,
but in include/linux/mm.h.
So, we don't need to worry about set_page_address() call in mm/page_alloc.c

> The irq-disabling in this code is odd.  If ARCH_NEEDS_KMAP_HIGH_GET=n,
> we didn't need irq-safe locking in set_page_address().  I guess we'll
> need to retain it in page_address() - I expect some callers have IRQs
> disabled.

As Minchan described, if we don't disable irq when we take a lock for pas->lock,
it would be deadlock with page_address().
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 3/5] mm, highmem: remove page_address_pool list

2012-10-31 Thread Joonsoo Kim

We can find free page_address_map instance without the page_address_pool.
So remove it.

Cc: Mel Gorman 
Cc: Peter Zijlstra 
Signed-off-by: Joonsoo Kim 
Reviewed-by: Minchan Kim 

diff --git a/mm/highmem.c b/mm/highmem.c
index 017bad1..d98b0a9 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -324,10 +324,7 @@ struct page_address_map {
struct list_head list;
 };
 
-/*
- * page_address_map freelist, allocated from page_address_maps.
- */
-static struct list_head page_address_pool; /* freelist */
+static struct page_address_map page_address_maps[LAST_PKMAP];
 
 /*
  * Hash table bucket
@@ -392,12 +389,7 @@ void set_page_address(struct page *page, void *virtual)
 
pas = page_slot(page);
if (virtual) {  /* Add */
-   BUG_ON(list_empty(&page_address_pool));
-
-   pam = list_entry(page_address_pool.next,
-   struct page_address_map, list);
-   list_del(&pam->list);
-
+   pam = &page_address_maps[PKMAP_NR((unsigned long)virtual)];
pam->page = page;
pam->virtual = virtual;
 
@@ -410,7 +402,6 @@ void set_page_address(struct page *page, void *virtual)
if (pam->page == page) {
list_del(&pam->list);
spin_unlock_irqrestore(&pas->lock, flags);
-   list_add_tail(&pam->list, &page_address_pool);
goto done;
}
}
@@ -420,15 +411,10 @@ done:
return;
 }
 
-static struct page_address_map page_address_maps[LAST_PKMAP];
-
 void __init page_address_init(void)
 {
int i;
 
-   INIT_LIST_HEAD(&page_address_pool);
-   for (i = 0; i < ARRAY_SIZE(page_address_maps); i++)
-   list_add(&page_address_maps[i].list, &page_address_pool);
for (i = 0; i < ARRAY_SIZE(page_address_htable); i++) {
INIT_LIST_HEAD(&page_address_htable[i].lh);
spin_lock_init(&page_address_htable[i].lock);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 5/5] mm, highmem: get virtual address of the page using PKMAP_ADDR()

2012-10-31 Thread Joonsoo Kim

In flush_all_zero_pkmaps(), we have an index of the pkmap associated the page.
Using this index, we can simply get virtual address of the page.
So change it.

Cc: Mel Gorman 
Cc: Peter Zijlstra 
Signed-off-by: Joonsoo Kim 
Reviewed-by: Minchan Kim 

diff --git a/mm/highmem.c b/mm/highmem.c
index b365f7b..675ec97 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -137,8 +137,7 @@ static unsigned int flush_all_zero_pkmaps(void)
 * So no dangers, even with speculative execution.
 */
page = pte_page(pkmap_page_table[i]);
-   pte_clear(&init_mm, (unsigned long)page_address(page),
- &pkmap_page_table[i]);
+   pte_clear(&init_mm, PKMAP_ADDR(i), &pkmap_page_table[i]);
 
set_page_address(page, NULL);
if (index == PKMAP_INVALID_INDEX)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 0/5] minor clean-up and optimize highmem related code

2012-10-31 Thread Joonsoo Kim

This patchset clean-up and optimize highmem related code.

Change from v1
Rebase on v3.7-rc3
[4] Instead of returning index of last flushed entry, return first index.
And update last_pkmap_nr to this index to optimize more.

Summary for v1
[1] is just clean-up and doesn't introduce any functional change.
[2-3] are for clean-up and optimization.
These eliminate an useless lock opearation and list management.
[4-5] is for optimization related to flush_all_zero_pkmaps().

Joonsoo Kim (5):
  mm, highmem: use PKMAP_NR() to calculate an index of pkmap
  mm, highmem: remove useless pool_lock
  mm, highmem: remove page_address_pool list
  mm, highmem: makes flush_all_zero_pkmaps() return index of first
flushed entry
  mm, highmem: get virtual address of the page using PKMAP_ADDR()

 include/linux/highmem.h |1 +
 mm/highmem.c|  108 ++-
 2 files changed, 51 insertions(+), 58 deletions(-)

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 1/5] mm, highmem: use PKMAP_NR() to calculate an index of pkmap

2012-10-31 Thread Joonsoo Kim

To calculate an index of pkmap, using PKMAP_NR() is more understandable
and maintainable, So change it.

Cc: Mel Gorman 
Cc: Peter Zijlstra 
Signed-off-by: Joonsoo Kim 
Reviewed-by: Minchan Kim 

diff --git a/mm/highmem.c b/mm/highmem.c
index d517cd1..b3b3d68 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -99,7 +99,7 @@ struct page *kmap_to_page(void *vaddr)
unsigned long addr = (unsigned long)vaddr;
 
if (addr >= PKMAP_ADDR(0) && addr <= PKMAP_ADDR(LAST_PKMAP)) {
-   int i = (addr - PKMAP_ADDR(0)) >> PAGE_SHIFT;
+   int i = PKMAP_NR(addr);
return pte_page(pkmap_page_table[i]);
}
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v2 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of first flushed entry

2012-10-31 Thread Joonsoo Kim

In current code, after flush_all_zero_pkmaps() is invoked,
then re-iterate all pkmaps. It can be optimized if flush_all_zero_pkmaps()
return index of first flushed entry. With this index,
we can immediately map highmem page to virtual address represented by index.
So change return type of flush_all_zero_pkmaps()
and return index of first flushed entry.

Additionally, update last_pkmap_nr to this index.
It is certain that entry which is below this index is occupied by other mapping,
therefore updating last_pkmap_nr to this index is reasonable optimization.

Cc: Mel Gorman 
Cc: Peter Zijlstra 
Cc: Minchan Kim 
Signed-off-by: Joonsoo Kim 

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index ef788b5..97ad208 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -32,6 +32,7 @@ static inline void invalidate_kernel_vmap_range(void *vaddr, 
int size)
 
 #ifdef CONFIG_HIGHMEM
 #include 
+#define PKMAP_INVALID_INDEX (LAST_PKMAP)
 
 /* declarations for linux/mm/highmem.c */
 unsigned int nr_free_highpages(void);
diff --git a/mm/highmem.c b/mm/highmem.c
index d98b0a9..b365f7b 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -106,10 +106,10 @@ struct page *kmap_to_page(void *vaddr)
return virt_to_page(addr);
 }
 
-static void flush_all_zero_pkmaps(void)
+static unsigned int flush_all_zero_pkmaps(void)
 {
int i;
-   int need_flush = 0;
+   unsigned int index = PKMAP_INVALID_INDEX;
 
flush_cache_kmaps();
 
@@ -141,10 +141,13 @@ static void flush_all_zero_pkmaps(void)
  &pkmap_page_table[i]);
 
set_page_address(page, NULL);
-   need_flush = 1;
+   if (index == PKMAP_INVALID_INDEX)
+   index = i;
}
-   if (need_flush)
+   if (index != PKMAP_INVALID_INDEX)
flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP));
+
+   return index;
 }
 
 /**
@@ -152,14 +155,19 @@ static void flush_all_zero_pkmaps(void)
  */
 void kmap_flush_unused(void)
 {
+   unsigned int index;
+
lock_kmap();
-   flush_all_zero_pkmaps();
+   index = flush_all_zero_pkmaps();
+   if (index != PKMAP_INVALID_INDEX && (index < last_pkmap_nr))
+   last_pkmap_nr = index;
unlock_kmap();
 }
 
 static inline unsigned long map_new_virtual(struct page *page)
 {
unsigned long vaddr;
+   unsigned int index = PKMAP_INVALID_INDEX;
int count;
 
 start:
@@ -168,40 +176,45 @@ start:
for (;;) {
last_pkmap_nr = (last_pkmap_nr + 1) & LAST_PKMAP_MASK;
if (!last_pkmap_nr) {
-   flush_all_zero_pkmaps();
-   count = LAST_PKMAP;
+   index = flush_all_zero_pkmaps();
+   break;
}
-   if (!pkmap_count[last_pkmap_nr])
+   if (!pkmap_count[last_pkmap_nr]) {
+   index = last_pkmap_nr;
break;  /* Found a usable entry */
-   if (--count)
-   continue;
-
-   /*
-* Sleep for somebody else to unmap their entries
-*/
-   {
-   DECLARE_WAITQUEUE(wait, current);
-
-   __set_current_state(TASK_UNINTERRUPTIBLE);
-   add_wait_queue(&pkmap_map_wait, &wait);
-   unlock_kmap();
-   schedule();
-   remove_wait_queue(&pkmap_map_wait, &wait);
-   lock_kmap();
-
-   /* Somebody else might have mapped it while we slept */
-   if (page_address(page))
-   return (unsigned long)page_address(page);
-
-   /* Re-start */
-   goto start;
}
+   if (--count == 0)
+   break;
}
-   vaddr = PKMAP_ADDR(last_pkmap_nr);
+
+   /*
+* Sleep for somebody else to unmap their entries
+*/
+   if (index == PKMAP_INVALID_INDEX) {
+   DECLARE_WAITQUEUE(wait, current);
+
+   __set_current_state(TASK_UNINTERRUPTIBLE);
+   add_wait_queue(&pkmap_map_wait, &wait);
+   unlock_kmap();
+   schedule();
+   remove_wait_queue(&pkmap_map_wait, &wait);
+   lock_kmap();
+
+   /* Somebody else might have mapped it while we slept */
+   vaddr = (unsigned long)page_address(page);
+   if (vaddr)
+   return vaddr;
+
+   /* Re-start */
+   goto start;
+   }
+
+   vaddr = PKMAP_ADDR(index);
set_pte_at(&init_mm, vaddr,
-  &(pkmap_page_table[last_pkmap_nr]), mk_pte(page, kmap_prot));
+  &(pkmap_page_table[index]), mk_

[PATCH v2 2/5] mm, highmem: remove useless pool_lock

2012-10-31 Thread Joonsoo Kim

The pool_lock protects the page_address_pool from concurrent access.
But, access to the page_address_pool is already protected by kmap_lock.
So remove it.

Cc: Mel Gorman 
Cc: Peter Zijlstra 
Signed-off-by: Joonsoo Kim 
Reviewed-by: Minchan Kim 

diff --git a/mm/highmem.c b/mm/highmem.c
index b3b3d68..017bad1 100644
--- a/mm/highmem.c
+++ b/mm/highmem.c
@@ -328,7 +328,6 @@ struct page_address_map {
  * page_address_map freelist, allocated from page_address_maps.
  */
 static struct list_head page_address_pool; /* freelist */
-static spinlock_t pool_lock;   /* protects page_address_pool */
 
 /*
  * Hash table bucket
@@ -395,11 +394,9 @@ void set_page_address(struct page *page, void *virtual)
if (virtual) {  /* Add */
BUG_ON(list_empty(&page_address_pool));
 
-   spin_lock_irqsave(&pool_lock, flags);
pam = list_entry(page_address_pool.next,
struct page_address_map, list);
list_del(&pam->list);
-   spin_unlock_irqrestore(&pool_lock, flags);
 
pam->page = page;
pam->virtual = virtual;
@@ -413,9 +410,7 @@ void set_page_address(struct page *page, void *virtual)
if (pam->page == page) {
list_del(&pam->list);
spin_unlock_irqrestore(&pas->lock, flags);
-   spin_lock_irqsave(&pool_lock, flags);
list_add_tail(&pam->list, &page_address_pool);
-   spin_unlock_irqrestore(&pool_lock, flags);
goto done;
}
}
@@ -438,7 +433,6 @@ void __init page_address_init(void)
INIT_LIST_HEAD(&page_address_htable[i].lh);
spin_lock_init(&page_address_htable[i].lock);
}
-   spin_lock_init(&pool_lock);
 }
 
 #endif /* defined(CONFIG_HIGHMEM) && !defined(WANT_PAGE_VIRTUAL) */
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] minor clean-up and optimize highmem related code

2012-10-31 Thread JoonSoo Kim

Hello, Andrew.

2012/10/29 JoonSoo Kim :
> Hi, Minchan.
>
> 2012/10/29 Minchan Kim :
>> Hi Joonsoo,
>>
>> On Mon, Oct 29, 2012 at 04:12:51AM +0900, Joonsoo Kim wrote:
>>> This patchset clean-up and optimize highmem related code.
>>>
>>> [1] is just clean-up and doesn't introduce any functional change.
>>> [2-3] are for clean-up and optimization.
>>> These eliminate an useless lock opearation and list management.
>>> [4-5] is for optimization related to flush_all_zero_pkmaps().
>>>
>>> Joonsoo Kim (5):
>>>   mm, highmem: use PKMAP_NR() to calculate an index of pkmap
>>>   mm, highmem: remove useless pool_lock
>>>   mm, highmem: remove page_address_pool list
>>>   mm, highmem: makes flush_all_zero_pkmaps() return index of last
>>> flushed entry
>>>   mm, highmem: get virtual address of the page using PKMAP_ADDR()
>>
>> This patchset looks awesome to me.
>> If you have a plan to respin, please CCed Peter.
>
> Thanks for review.
> I will wait more review and respin, the day after tomorrow.
> Version 2 will include fix about your comment.

Could you pick up second version of this patchset?

[3] is changed to leave one blank line.
[4] is changed in order to further optimize according to Minchan's comment.
It return first index of flushed entry, instead of last index.
Others doesn't changed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 4/5] mm, highmem: makes flush_all_zero_pkmaps() return index of first flushed entry

2012-11-02 Thread JoonSoo Kim

Hello, Minchan.

2012/11/1 Minchan Kim :
> On Thu, Nov 01, 2012 at 01:56:36AM +0900, Joonsoo Kim wrote:
>> In current code, after flush_all_zero_pkmaps() is invoked,
>> then re-iterate all pkmaps. It can be optimized if flush_all_zero_pkmaps()
>> return index of first flushed entry. With this index,
>> we can immediately map highmem page to virtual address represented by index.
>> So change return type of flush_all_zero_pkmaps()
>> and return index of first flushed entry.
>>
>> Additionally, update last_pkmap_nr to this index.
>> It is certain that entry which is below this index is occupied by other 
>> mapping,
>> therefore updating last_pkmap_nr to this index is reasonable optimization.
>>
>> Cc: Mel Gorman 
>> Cc: Peter Zijlstra 
>> Cc: Minchan Kim 
>> Signed-off-by: Joonsoo Kim 
>>
>> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
>> index ef788b5..97ad208 100644
>> --- a/include/linux/highmem.h
>> +++ b/include/linux/highmem.h
>> @@ -32,6 +32,7 @@ static inline void invalidate_kernel_vmap_range(void 
>> *vaddr, int size)
>>
>>  #ifdef CONFIG_HIGHMEM
>>  #include 
>> +#define PKMAP_INVALID_INDEX (LAST_PKMAP)
>>
>>  /* declarations for linux/mm/highmem.c */
>>  unsigned int nr_free_highpages(void);
>> diff --git a/mm/highmem.c b/mm/highmem.c
>> index d98b0a9..b365f7b 100644
>> --- a/mm/highmem.c
>> +++ b/mm/highmem.c
>> @@ -106,10 +106,10 @@ struct page *kmap_to_page(void *vaddr)
>>   return virt_to_page(addr);
>>  }
>>
>> -static void flush_all_zero_pkmaps(void)
>> +static unsigned int flush_all_zero_pkmaps(void)
>>  {
>>   int i;
>> - int need_flush = 0;
>> + unsigned int index = PKMAP_INVALID_INDEX;
>>
>>   flush_cache_kmaps();
>>
>> @@ -141,10 +141,13 @@ static void flush_all_zero_pkmaps(void)
>> &pkmap_page_table[i]);
>>
>>   set_page_address(page, NULL);
>> - need_flush = 1;
>> + if (index == PKMAP_INVALID_INDEX)
>> + index = i;
>>   }
>> - if (need_flush)
>> + if (index != PKMAP_INVALID_INDEX)
>>   flush_tlb_kernel_range(PKMAP_ADDR(0), PKMAP_ADDR(LAST_PKMAP));
>> +
>> + return index;
>>  }
>>
>>  /**
>> @@ -152,14 +155,19 @@ static void flush_all_zero_pkmaps(void)
>>   */
>>  void kmap_flush_unused(void)
>>  {
>> + unsigned int index;
>> +
>>   lock_kmap();
>> - flush_all_zero_pkmaps();
>> + index = flush_all_zero_pkmaps();
>> + if (index != PKMAP_INVALID_INDEX && (index < last_pkmap_nr))
>> + last_pkmap_nr = index;
>
> I don't know how kmap_flush_unused is really fast path so how my nitpick
> is effective. Anyway,
> What problem happens if we do following as?
>
> lock()
> index = flush_all_zero_pkmaps();
> if (index != PKMAP_INVALID_INDEX)
> last_pkmap_nr = index;
> unlock();
>
> Normally, last_pkmap_nr is increased with searching empty slot in
> map_new_virtual. So I expect return value of flush_all_zero_pkmaps
> in kmap_flush_unused normally become either less than last_pkmap_nr
> or last_pkmap_nr + 1.

There is a case that return value of kmap_flush_unused() is larger
than last_pkmap_nr.
Look at the following example.

Assume last_pkmap = 20 and index 1-9, 11-19 is kmapped. 10 is kunmapped.

do kmap_flush_unused() => flush index 10 => last_pkmap = 10;
do kunmap() with index 17
do kmap_flush_unused() => flush index 17

So, little dirty implementation is needed.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v6 00/29] kmem controller for memcg.

2012-11-02 Thread JoonSoo Kim

Hello, Glauber.

2012/11/2 Glauber Costa :
> On 11/02/2012 04:04 AM, Andrew Morton wrote:
>> On Thu,  1 Nov 2012 16:07:16 +0400
>> Glauber Costa  wrote:
>>
>>> Hi,
>>>
>>> This work introduces the kernel memory controller for memcg. Unlike previous
>>> submissions, this includes the whole controller, comprised of slab and stack
>>> memory.
>>
>> I'm in the middle of (re)reading all this.  Meanwhile I'll push it all
>> out to http://ozlabs.org/~akpm/mmots/ for the crazier testers.
>>
>> One thing:
>>
>>> Numbers can be found at https://lkml.org/lkml/2012/9/13/239
>>
>> You claim in the above that the fork worload is 'slab intensive".  Or
>> at least, you seem to - it's a bit fuzzy.
>>
>> But how slab intensive is it, really?
>>
>> What is extremely slab intensive is networking.  The networking guys
>> are very sensitive to slab performance.  If this hasn't already been
>> done, could you please determine what impact this has upon networking?
>> I expect Eric Dumazet, Dave Miller and Tom Herbert could suggest
>> testing approaches.
>>
>
> I can test it, but unfortunately I am unlikely to get to prepare a good
> environment before Barcelona.
>
> I know, however, that Greg Thelen was testing netperf in his setup.
> Greg, do you have any publishable numbers you could share?

Below is my humble opinion.
I am worrying about data cache footprint which is possibly caused by
this patchset, especially slab implementation.
If there are several memcg cgroups, each cgroup has it's own kmem_caches.
When each group do slab-intensive job hard, data cache may be overflowed easily,
and cache miss rate will be high, therefore this would decrease system
performance highly.
Is there any result about this?

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RESEND PATCH 3/4 v3] mm: fix return value in __alloc_contig_migrate_range()

2012-07-27 Thread Joonsoo Kim

migrate_pages() can return positive value while at the same time emptying
the list of pages it was called with.  Such situation means that it went
through all the pages on the list some of which failed to be migrated.

If that happens, __alloc_contig_migrate_range()'s loop may finish without
"++tries == 5" never being checked.  This in turn means that at the end
of the function, ret may have a positive value, which should be treated
as an error.

This patch changes __alloc_contig_migrate_range() so that the return
statement converts positive ret value into -EBUSY error.

Signed-off-by: Joonsoo Kim 
Cc: Michal Nazarewicz 
Cc: Marek Szyprowski 
Cc: Minchan Kim 
Cc: Christoph Lameter 
Acked-by: Christoph Lameter 
Acked-by: Michal Nazarewicz 

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4403009..02d4519 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5673,7 +5673,6 @@ static int __alloc_contig_migrate_range(unsigned long 
start, unsigned long end)
}
tries = 0;
} else if (++tries == 5) {
-   ret = ret < 0 ? ret : -EBUSY;
break;
}
 
@@ -5683,7 +5682,7 @@ static int __alloc_contig_migrate_range(unsigned long 
start, unsigned long end)
}
 
putback_lru_pages(&cc.migratepages);
-   return ret > 0 ? 0 : ret;
+   return ret <= 0 ? ret : -EBUSY;
 }
 
 /*
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2325 matches

Mail list logo