On Mon, Nov 18, 2019 at 02:42:42PM +0300, Roman Bolshakov wrote: > On Mon, Nov 18, 2019 at 01:02:00PM +1100, Daniel Axtens wrote: > > Hi Roman, > > > > > We're running a lot of KVM virtual machines on POWER8 hosts and > > > sometimes new VMs can't be started because there are no contiguous > > > regions for HPT because of CMA region fragmentation. > > > > > > The issue is covered in the LWN article: https://lwn.net/Articles/684611/ > > > The article points that you raised the problem on LSFMM 2016. However I > > > couldn't find a follow up article on the issue. > > > > > > Looking at the kernel commit log I've identified a few commits that > > > might reduce CMA fragmentaiton and overcome HPT allocation failure: > > > - bd2e75633c801 ("dma-contiguous: use fallback alloc_pages for single > > > pages") > > > - 678e174c4c16a ("powerpc/mm/iommu: allow migration of cma allocated > > > pages during mm_iommu_do_alloc") > > > - 9a4e9f3b2d739 ("mm: update get_user_pages_longterm to migrate pages > > > allocated from > > > CMA region") > > > - d7fefcc8de914 ("mm/cma: add PF flag to force non cma alloc") > > > > > > Are there any other commits that address the issue? What is the first > > > kernel version that shouldn't have the HPT allocation problem due to CMA > > > fragmentation? > > > > I've had some success increasing the CMA allocation with the > > kvm_cma_resv_ratio boot parameter - see > > arch/powerpc/kvm/book3s_hv_builtin.c > > > > The default is 5%. In a support case in a former job we had a customer > > who increased this to I think 7 or 8% and saw the symptoms subside > > dramatically. > > > > Hi Daniel, > > Thank you, I'll try to increase kvm_cma_resv_ratio for now, but even 5% > CMA reserve should be more than enough, given the size of HPT as 1/128th > of VM max memory. > > For a 16GB RAM VM without balloon device, only 128MB is going to be > reserved for HPT using CMA. So, 5% CMA reserve should allow to provision > VMs with over 1.5TB of RAM on 256GB RAM host. In other words the default > CMA reserve allows to overprovision 6 times more memory for VMs than > presented on a host. > > We rarely add balloon device and sometimes don't add it at all. Therefore > I'm still looking for commits that would help to avoid the issue with > the default CMA reserve. >
FWIW, I have noticed the following. My host has 4 NUMA nodes with 4 CPUs per node, only one of the nodes have CMA pages and only two of the nodes have memory according to /proc/zoneinfo. The error can be reliably reproduced if I attempt to place vCPUs on the node with CMA pages. Roman