On 2/16/19 4:40 AM, David Hildenbrand wrote: > On 04.02.19 21:18, Nitesh Narayan Lal wrote: > > Hi Nitesh, > > I thought again about how s390x handles free page hinting. As that seems > to work just fine, I guess sticking to a similar model makes sense. > > > I already explained in this thread how it works on s390x, a short summary: > > 1. Each VCPU has a buffer of pfns to be reported to the hypervisor. If I > am not wrong, it contains 512 entries, so is exactly 1 page big. This > buffer is stored in the hypervisor and is on page granularity. > > 2. This page buffer is managed via the ESSA instruction. In addition, to > synchronize with the guest ("page reused when freeing in the > hypervisor"), special bits in the host->guest page table can be > set/locked via the ESSA instruction by the guest and similarly accessed > by the hypervisor. > > 3. Once the buffer is full, the guest does a synchronous hypercall, > going over all 512 entries and zapping them (== similar to MADV_DONTNEED) > > > To mimic that, we > > 1. Have a static buffer per VCPU in the guest with 512 entries. You > basically have that already. > > 2. On every free, add the page _or_ the page after merging by the buddy > (e.g. MAX_ORDER - 1) to the buffer (this is where we could be better > than s390x). You basically have that already. > > 3. If the buffer is full, try to isolate all pages and do a synchronous > report to the hypervisor. You have the first part already. The second > part would require a change (don't use a separate/global thread to do > the hinting, just do it synchronously). > > 4. One hinting is done, putback all isolated pages to the budy. You > basically have that already. > > > For 3. we can try what you have right now, using virtio. If we detect > that's a problem, we can do it similar to what Alexander proposes and > just do a bare hypercall. It's just a different way of carrying out the > same task. > > > This approach > 1. Mimics what s390x does, besides supporting different granularities. > To synchronize guest->host we simply take the pages off the buddy. > > 2. Is basically what Alexander does, however his design limitation is > that doing any hinting on smaller granularities will not work because > there will be too many synchronous hints. Bad on fragmented guests. > > 3. Does not require any dynamic data structures in the guest. > > 4. Does not block allocation paths. > > 5. Blocks on e.g. every 512'ed free. It seems to work on s390x, why > shouldn't it for us. We have to measure. > > 6. We are free to decide which granularity we report. > > 7. Potentially works even if the guest memory is fragmented (little > MAX_ORDER - 1) pages. > > It would be worth a try. My feeling is that a synchronous report after > e.g. 512 frees should be acceptable, as it seems to be acceptable on > s390x. (basically always enabled, nobody complains).
The reason I like the current approach of reporting via separate kernel thread is that it doesn't block any regular allocation/freeing code path in anyways. > > We would have to play with how to enable/disable reporting and when to > not report because it's not worth it in the guest (e.g. low on memory). > > > Do you think something like this would be easy to change/implement and > measure? I can do that as I figure out a real world guest workload using which the two approaches can be compared. > Thanks! > >> The following patch-set proposes an efficient mechanism for handing freed >> memory between the guest and the host. It enables the guests with no page >> cache to rapidly free and reclaims memory to and from the host respectively. >> >> Benefit: >> With this patch-series, in our test-case, executed on a single system and >> single NUMA node with 15GB memory, we were able to successfully launch >> atleast 5 guests >> when page hinting was enabled and 3 without it. (Detailed explanation of the >> test procedure is provided at the bottom). >> >> Changelog in V8: >> In this patch-series, the earlier approach [1] which was used to capture and >> scan the pages freed by the guest has been changed. The new approach is >> briefly described below: >> >> The patch-set still leverages the existing arch_free_page() to add this >> functionality. It maintains a per CPU array which is used to store the pages >> freed by the guest. The maximum number of entries which it can hold is >> defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it >> is scanned and only the pages which are available in the buddy are stored. >> This process continues until the array is filled with pages which are part >> of the buddy free list. After which it wakes up a kernel per-cpu-thread. >> This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation >> and if the page is not reallocated and present in the buddy, the kernel >> thread attempts to isolate it from the buddy. If it is successfully >> isolated, the page is added to another per-cpu array. Once the entire >> scanning process is complete, all the isolated pages are reported to the >> host through an existing virtio-balloon driver. >> >> Known Issues: >> * Fixed array size: The problem with having a fixed/hardcoded array >> size arises when the size of the guest varies. For example when the guest >> size increases and it starts making large allocations fixed size limits this >> solution's ability to capture all the freed pages. This will result in less >> guest free memory getting reported to the host. >> >> Known code re-work: >> * Plan to re-use Wei's work, which communicates the poison value to the >> host. >> * The nomenclatures used in virtio-balloon needs to be changed so that >> the code can easily be distinguished from Wei's Free Page Hint code. >> * Sorting based on zonenum, to avoid repetitive zone locks for the same >> zone. >> >> Other required work: >> * Run other benchmarks to evaluate the performance/impact of this >> approach. >> >> Test case: >> Setup: >> Memory-15837 MB >> Guest Memory Size-5 GB >> Swap-Disabled >> Test Program-Simple program which allocates 4GB memory via malloc, touches >> it via memset and exits. >> Use case-Number of guests that can be launched completely including the >> successful execution of the test program. >> Procedure: >> The first guest is launched and once its console is up, the test allocation >> program is executed with 4 GB memory request (Due to this the guest occupies >> almost 4-5 GB of memory in the host in a system without page hinting). Once >> this program exits at that time another guest is launched in the host and >> the same process is followed. We continue launching the guests until a guest >> gets killed due to low memory condition in the host. >> >> Result: >> Without Hinting-3 Guests >> With Hinting-5 to 7 Guests(Based on the amount of memory freed/captured). >> >> [1] https://www.spinics.net/lists/kvm/msg170113.html >> >> > -- Regards Nitesh
signature.asc
Description: OpenPGP digital signature