On 2/16/19 4:40 AM, David Hildenbrand wrote:
> On 04.02.19 21:18, Nitesh Narayan Lal wrote:
>
> Hi Nitesh,
>
> I thought again about how s390x handles free page hinting. As that seems
> to work just fine, I guess sticking to a similar model makes sense.
>
>
> I already explained in this thread how it works on s390x, a short summary:
>
> 1. Each VCPU has a buffer of pfns to be reported to the hypervisor. If I
> am not wrong, it contains 512 entries, so is exactly 1 page big. This
> buffer is stored in the hypervisor and is on page granularity.
>
> 2. This page buffer is managed via the ESSA instruction. In addition, to
> synchronize with the guest ("page reused when freeing in the
> hypervisor"), special bits in the host->guest page table can be
> set/locked via the ESSA instruction by the guest and similarly accessed
> by the hypervisor.
>
> 3. Once the buffer is full, the guest does a synchronous hypercall,
> going over all 512 entries and zapping them (== similar to MADV_DONTNEED)
>
>
> To mimic that, we
>
> 1. Have a static buffer per VCPU in the guest with 512 entries. You
> basically have that already.
>
> 2. On every free, add the page _or_ the page after merging by the buddy
> (e.g. MAX_ORDER - 1) to the buffer (this is where we could be better
> than s390x). You basically have that already.
>
> 3. If the buffer is full, try to isolate all pages and do a synchronous
> report to the hypervisor. You have the first part already. The second
> part would require a change (don't use a separate/global thread to do
> the hinting, just do it synchronously).
>
> 4. One hinting is done, putback all isolated pages to the budy. You
> basically have that already.
>
>
> For 3. we can try what you have right now, using virtio. If we detect
> that's a problem, we can do it similar to what Alexander proposes and
> just do a bare hypercall. It's just a different way of carrying out the
> same task.
>
>
> This approach
> 1. Mimics what s390x does, besides supporting different granularities.
> To synchronize guest->host we simply take the pages off the buddy.
>
> 2. Is basically what Alexander does, however his design limitation is
> that doing any hinting on smaller granularities will not work because
> there will be too many synchronous hints. Bad on fragmented guests.
>
> 3. Does not require any dynamic data structures in the guest.
>
> 4. Does not block allocation paths.
>
> 5. Blocks on e.g. every 512'ed free. It seems to work on s390x, why
> shouldn't it for us. We have to measure.
>
> 6. We are free to decide which granularity we report.
>
> 7. Potentially works even if the guest memory is fragmented (little
> MAX_ORDER - 1) pages.
>
> It would be worth a try. My feeling is that a synchronous report after
> e.g. 512 frees should be acceptable, as it seems to be acceptable on
> s390x. (basically always enabled, nobody complains).

The reason I like the current approach of reporting via separate kernel
thread is that it doesn't block any regular allocation/freeing code path
in anyways.
>
> We would have to play with how to enable/disable reporting and when to
> not report because it's not worth it in the guest (e.g. low on memory).
>
>
> Do you think something like this would be easy to change/implement and
> measure?

I can do that as I figure out a real world guest workload using which
the two approaches can be compared.

> Thanks!
>
>> The following patch-set proposes an efficient mechanism for handing freed 
>> memory between the guest and the host. It enables the guests with no page 
>> cache to rapidly free and reclaims memory to and from the host respectively.
>>
>> Benefit:
>> With this patch-series, in our test-case, executed on a single system and 
>> single NUMA node with 15GB memory, we were able to successfully launch 
>> atleast 5 guests 
>> when page hinting was enabled and 3 without it. (Detailed explanation of the 
>> test procedure is provided at the bottom).
>>
>> Changelog in V8:
>> In this patch-series, the earlier approach [1] which was used to capture and 
>> scan the pages freed by the guest has been changed. The new approach is 
>> briefly described below:
>>
>> The patch-set still leverages the existing arch_free_page() to add this 
>> functionality. It maintains a per CPU array which is used to store the pages 
>> freed by the guest. The maximum number of entries which it can hold is 
>> defined by MAX_FGPT_ENTRIES(1000). When the array is completely filled, it 
>> is scanned and only the pages which are available in the buddy are stored. 
>> This process continues until the array is filled with pages which are part 
>> of the buddy free list. After which it wakes up a kernel per-cpu-thread.
>> This kernel per-cpu-thread rescans the per-cpu-array for any re-allocation 
>> and if the page is not reallocated and present in the buddy, the kernel 
>> thread attempts to isolate it from the buddy. If it is successfully 
>> isolated, the page is added to another per-cpu array. Once the entire 
>> scanning process is complete, all the isolated pages are reported to the 
>> host through an existing virtio-balloon driver.
>>
>> Known Issues:
>>      * Fixed array size: The problem with having a fixed/hardcoded array 
>> size arises when the size of the guest varies. For example when the guest 
>> size increases and it starts making large allocations fixed size limits this 
>> solution's ability to capture all the freed pages. This will result in less 
>> guest free memory getting reported to the host.
>>
>> Known code re-work:
>>      * Plan to re-use Wei's work, which communicates the poison value to the 
>> host.
>>      * The nomenclatures used in virtio-balloon needs to be changed so that 
>> the code can easily be distinguished from Wei's Free Page Hint code.
>>      * Sorting based on zonenum, to avoid repetitive zone locks for the same 
>> zone.
>>
>> Other required work:
>>      * Run other benchmarks to evaluate the performance/impact of this 
>> approach.
>>
>> Test case:
>> Setup:
>> Memory-15837 MB
>> Guest Memory Size-5 GB
>> Swap-Disabled
>> Test Program-Simple program which allocates 4GB memory via malloc, touches 
>> it via memset and exits.
>> Use case-Number of guests that can be launched completely including the 
>> successful execution of the test program.
>> Procedure: 
>> The first guest is launched and once its console is up, the test allocation 
>> program is executed with 4 GB memory request (Due to this the guest occupies 
>> almost 4-5 GB of memory in the host in a system without page hinting). Once 
>> this program exits at that time another guest is launched in the host and 
>> the same process is followed. We continue launching the guests until a guest 
>> gets killed due to low memory condition in the host.
>>
>> Result:
>> Without Hinting-3 Guests
>> With Hinting-5 to 7 Guests(Based on the amount of memory freed/captured).
>>
>> [1] https://www.spinics.net/lists/kvm/msg170113.html 
>>
>>
>
-- 
Regards
Nitesh

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to