On Tue, Mar 19, 2013 at 11:06:35AM +0800, Xiao Guangrong wrote:
> On 03/19/2013 04:46 AM, Marcelo Tosatti wrote:
> > On Wed, Mar 13, 2013 at 12:59:12PM +0800, Xiao Guangrong wrote:
> >> The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
> >> walk and zap all shadow pages one by one, also it need to zap all guest
> >> page's rmap and all shadow page's parent spte list. Particularly, things
> >> become worse if guest uses more memory or vcpus. It is not good for
> >> scalability.
> >>
> >> Since all shadow page will be zapped, we can directly zap the mmu-cache
> >> and rmap so that vcpu will fault on the new mmu-cache, after that, we can
> >> directly free the memory used by old mmu-cache.
> >>
> >> The root shadow page is little especial since they are currently used by
> >> vcpus, we can not directly free them. So, we zap the root shadow pages and
> >> re-add them into the new mmu-cache.
> >>
> >> After this patch, kvm_mmu_zap_all can be faster 113% than before
> >>
> >> Signed-off-by: Xiao Guangrong <xiaoguangr...@linux.vnet.ibm.com>
> >> ---
> >>  arch/x86/kvm/mmu.c |   62 
> >> ++++++++++++++++++++++++++++++++++++++++++++++-----
> >>  1 files changed, 56 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >> index e326099..536d9ce 100644
> >> --- a/arch/x86/kvm/mmu.c
> >> +++ b/arch/x86/kvm/mmu.c
> >> @@ -4186,18 +4186,68 @@ void kvm_mmu_slot_remove_write_access(struct kvm 
> >> *kvm, int slot)
> >>
> >>  void kvm_mmu_zap_all(struct kvm *kvm)
> >>  {
> >> -  struct kvm_mmu_page *sp, *node;
> >> +  LIST_HEAD(root_mmu_pages);
> >>    LIST_HEAD(invalid_list);
> >> +  struct list_head pte_list_descs;
> >> +  struct kvm_mmu_cache *cache = &kvm->arch.mmu_cache;
> >> +  struct kvm_mmu_page *sp, *node;
> >> +  struct pte_list_desc *desc, *ndesc;
> >> +  int root_sp = 0;
> >>
> >>    spin_lock(&kvm->mmu_lock);
> >> +
> >>  restart:
> >> -  list_for_each_entry_safe(sp, node,
> >> -        &kvm->arch.mmu_cache.active_mmu_pages, link)
> >> -          if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
> >> -                  goto restart;
> >> +  /*
> >> +   * The root shadow pages are being used on vcpus that can not
> >> +   * directly removed, we filter them out and re-add them to the
> >> +   * new mmu cache.
> >> +   */
> >> +  list_for_each_entry_safe(sp, node, &cache->active_mmu_pages, link)
> >> +          if (sp->root_count) {
> >> +                  int ret;
> >> +
> >> +                  root_sp++;
> >> +                  ret = kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
> >> +                  list_move(&sp->link, &root_mmu_pages);
> >> +                  if (ret)
> >> +                          goto restart;
> >> +          }
> >> +
> >> +  list_splice(&cache->active_mmu_pages, &invalid_list);
> >> +  list_replace(&cache->pte_list_descs, &pte_list_descs);
> >> +
> >> +  /*
> >> +   * Reset the mmu cache so that later vcpu will fault on the new
> >> +   * mmu cache.
> >> +   */
> >> +  memset(cache, 0, sizeof(*cache));
> >> +  kvm_mmu_init(kvm);
> > 
> > Xiao,
> > 
> > I suppose zeroing of kvm_mmu_cache can be avoided, if the links are
> > removed at prepare_zap_page. So perhaps
> 
> The purpose of zeroing of kvm_mmu_cache is resetting the hashtable and
> some count numbers.
> [.n_request_mmu_pages and .n_max_mmu_pages should not be changed, i will
> fix this].
> 
> > 
> > - spin_lock(mmu_lock)
> > - for each page
> >     - zero sp->spt[], remove page from linked lists
> 
> sizeof(mmu_cache) is:
> (1 << 10) * sizeof (hlist_head) + 4 * sizeof(unsigned int) = 2^13 + 16
> and it is constant. In your way, for every sp, we need to zap:
> 512 entries + a hash-node = 2^12 + 8
> especially the workload depends on the size of guest memory.
> Why you think this way is better?

Its not of course. 

> > - flush remote TLB (batched)
> > - spin_unlock(mmu_lock)
> > - free data (which is safe because freeing has its own serialization)
> 
> We should free the root sp in mmu-lock like my patch.
> 
> > - spin_lock(mmu_lock)
> > - account for the pages freed
> > - spin_unlock(mmu_lock)
> 
> The count numbers are still inconsistent if other thread hold mmu-lock between
> zero shadow page and recount.
> 
> Marcelo, i really confused what is the benefit in this way but i might
> completely misunderstand it.

I misunderstood the benefit of your idea (now i got it: zap root
and flush TLB guarantees vcpus will refault). What i'd like to avoid is

memset(cache, 0, sizeof(*cache));
kvm_mmu_init(kvm);

I'd prefer normal operations on those data structures (in mmu_cache).
And also the page accounting is a problem.

Perhaps you can use a generation number to consider whether shadow pages
are valid? So: 

find_sp(gfn_t gfn)
lookup hash
if sp->generation_number != mmu->current_generation_number
        initialize page as if it were just allocated (but keep it in the hash 
list)

And on kvm_mmu_zap_all()
spin_lock(mmu_lock)
for each page
if page->root_count
        zero sp->spt[]

flush TLB
mmu->current_generation_number++
spin_unlock(mmu_lock)

Then have kvm_mmu_free_all() that actually frees all data.

Hum, not sure if thats any better than your current patchset.
Well, maybe resend patchset with bug fixes / improvements and 
we go from there.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to