CC'ed EAL hugepage maintainer, which is something you should do when
send a patch.

On Fri, Feb 05, 2016 at 07:20:24PM +0800, Jianfeng Tan wrote:
> Originally, there're two cons in using hugepage: a. needs root
> privilege to touch /proc/self/pagemap, which is a premise to
> alllocate physically contiguous memseg; b. possibly too many
> hugepage file are created, especially used with 2M hugepage.
> 
> For virtual devices, they don't care about physical-contiguity
> of allocated hugepages at all. Option --single-file is to
> provide a way to allocate all hugepages into single mem-backed
> file.
> 
> Known issue:
> a. single-file option relys on kernel to allocate numa-affinitive
> memory.
> b. possible ABI break, originally, --no-huge uses anonymous memory
> instead of file-backed way to create memory.
> 
> Signed-off-by: Huawei Xie <huawei.xie at intel.com>
> Signed-off-by: Jianfeng Tan <jianfeng.tan at intel.com>
...
> @@ -956,6 +961,16 @@ eal_check_common_options(struct internal_config 
> *internal_cfg)
>                       "be specified together with --"OPT_NO_HUGE"\n");
>               return -1;
>       }
> +     if (internal_cfg->single_file && internal_cfg->force_sockets == 1) {
> +             RTE_LOG(ERR, EAL, "Option --"OPT_SINGLE_FILE" cannot "
> +                     "be specified together with --"OPT_SOCKET_MEM"\n");
> +             return -1;
> +     }
> +     if (internal_cfg->single_file && internal_cfg->hugepage_unlink) {
> +             RTE_LOG(ERR, EAL, "Option --"OPT_HUGE_UNLINK" cannot "
> +                     "be specified together with --"OPT_SINGLE_FILE"\n");
> +             return -1;
> +     }

The two limitation doesn't make sense to me.

> diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c 
> b/lib/librte_eal/linuxapp/eal/eal_memory.c
> index 6008533..68ef49a 100644
> --- a/lib/librte_eal/linuxapp/eal/eal_memory.c
> +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
> @@ -1102,20 +1102,54 @@ rte_eal_hugepage_init(void)
>       /* get pointer to global configuration */
>       mcfg = rte_eal_get_configuration()->mem_config;
>  
> -     /* hugetlbfs can be disabled */
> -     if (internal_config.no_hugetlbfs) {
> -             addr = mmap(NULL, internal_config.memory, PROT_READ | 
> PROT_WRITE,
> -                             MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
> +     /* when hugetlbfs is disabled or single-file option is specified */
> +     if (internal_config.no_hugetlbfs || internal_config.single_file) {
> +             int fd;
> +             uint64_t pagesize;
> +             unsigned socket_id = rte_socket_id();
> +             char filepath[MAX_HUGEPAGE_PATH];
> +
> +             if (internal_config.no_hugetlbfs) {
> +                     eal_get_hugefile_path(filepath, sizeof(filepath),
> +                                           "/dev/shm", 0);
> +                     pagesize = RTE_PGSIZE_4K;
> +             } else {
> +                     struct hugepage_info *hpi;
> +
> +                     hpi = &internal_config.hugepage_info[0];
> +                     eal_get_hugefile_path(filepath, sizeof(filepath),
> +                                           hpi->hugedir, 0);
> +                     pagesize = hpi->hugepage_sz;
> +             }
> +             fd = open(filepath, O_CREAT | O_RDWR, S_IRUSR | S_IWUSR);
> +             if (fd < 0) {
> +                     RTE_LOG(ERR, EAL, "%s: open %s failed: %s\n",
> +                             __func__, filepath, strerror(errno));
> +                     return -1;
> +             }
> +
> +             if (ftruncate(fd, internal_config.memory) < 0) {
> +                     RTE_LOG(ERR, EAL, "ftuncate %s failed: %s\n",
> +                             filepath, strerror(errno));
> +                     return -1;
> +             }
> +
> +             addr = mmap(NULL, internal_config.memory,
> +                         PROT_READ | PROT_WRITE,
> +                         MAP_SHARED | MAP_POPULATE, fd, 0);
>               if (addr == MAP_FAILED) {
> -                     RTE_LOG(ERR, EAL, "%s: mmap() failed: %s\n", __func__,
> -                                     strerror(errno));
> +                     RTE_LOG(ERR, EAL, "%s: mmap() failed: %s\n",
> +                             __func__, strerror(errno));
>                       return -1;
>               }
>               mcfg->memseg[0].phys_addr = (phys_addr_t)(uintptr_t)addr;
>               mcfg->memseg[0].addr = addr;
> -             mcfg->memseg[0].hugepage_sz = RTE_PGSIZE_4K;
> +             mcfg->memseg[0].hugepage_sz = pagesize;
>               mcfg->memseg[0].len = internal_config.memory;
> -             mcfg->memseg[0].socket_id = 0;
> +             mcfg->memseg[0].socket_id = socket_id;

I saw quite few issues:

- Assume I have a system with two hugepage sizes: 1G (x4) and 2M (x512),
  mounted at /dev/hugepages and /mnt, respectively.

  Here we then got an 5G internal_config.memory, and your code will
  try to mmap 5G on the first mount point (/dev/hugepages) due to the
  hardcode logic in your code:

      hpi = &internal_config.hugepage_info[0];
      eal_get_hugefile_path(filepath, sizeof(filepath),
                      hpi->hugedir, 0);

  But it has 4G in total, therefore, it will fails.

- As you stated, socket_id is hardcoded, which could be wrong.

- As stated in above, the option limitation doesn't seem right to me.

  I mean, --single-file should be able to work with --socket-mem option
  in semantic.


And I have been thinking how to deal with those issues properly, and a
__very immature__ solution come to my mind (which could be simply not
working), but anyway, here is FYI: we go through the same process to
handle normal huge page initilization to --single-file option as well.
But we take different actions or no actions at all at some stages when
that option is given, which is a bit similiar with the way of handling
RTE_EAL_SINGLE_FILE_SEGMENTS.

And we create one hugepage file for each node, each page size. For a
system like mine above (2 nodes), it may populate files like following:

- 1G x 2 on node0
- 1G x 2 on node1
- 2M x 256 on node0
- 2M x 256 on node1

That could normally fit your case. Though 4 nodes looks like the maximum
node number, --socket-mem option may relieve the limit a bit.

And if we "could" not care the socket_id being set correctly, we could
just simply allocate one file for each hugepage size. That would work
well for your container enabling.

BTW, since we already have SINGLE_FILE_SEGMENTS (config) option, adding
another option --single-file looks really confusing to me.

To me, maybe you could base the SINGLE_FILE_SEGMENTS option, and add
another option, say --no-sort (I confess this name sucks, but you get
my point). With that, we could make sure to create as least huge page
files as possible, to fit your case.

        --yliu

Reply via email to