Hugh Dickins wrote: > I see the discussion has somehow moved on to pagetable locking - > mysterious because the word "lock" never once appears in your > otherwise very helpful elucidation below, for which many thanks. >
It has to be taken with a grain of salt. The reason "lock" isn't mentioned is because I mis-analyzed the situation, and overlooked that page_referenced_one does actually take the pte's lock. > Maybe what I have to add is now of historical interest only, or none, > but I was prevented from answering your original mail earlier... > > On Thu, 4 Oct 2007, Jeremy Fitzhardinge wrote: > > >> David's change 10a8d6ae4b3182d6588a5809a8366343bc295c20, "i386: add >> ptep_test_and_clear_{dirty,young}" has introduced an SMP race which >> affects the Xen pv-ops backend. >> > > I think that race with Xen has been there all along, not introduced > by David's commit. Take a look at ptep_clear_flush_young() in the > 2.6.21 include/asm-i386/pgtable.h, it looks equally a problem to me. > Yes, but I don't think its a problem with correct locking. set_pte_at is a red herring. >> When a pagetable is first created (either in execve or fork), the the >> Xen paravirt backend pins the pagetable, and conversely, on exit it is >> unpinned; this is done via the arch_dup_mmap() and activate_mm() hooks. >> Pinning is done in two phases: first the pagetable pages are marked RO, >> and then the pagetable is registered with Xen; unpinning is the >> > > To my naive mind, your problem actually lies in those two stages: > whatever marks the pages RO should not be keeping Xen in ignorance. > No, its the Xen-specific kernel code which does the RO marking: it marks the pagetables RO (using a hypercall to make the actual modifications), and then tells the hypervisor to pin the whole pagetable. It can't be done atomically, so there's always a window between the two phases. >> It all worked OK before David's change, because asm-generic/pgtable.h >> uses set_pte_at(), which ends up making a hypercall to update the >> pagetable, which always works regardless of the state of the pagetable >> pages. >> > > Except ptep_clear_flush_young() didn't use set_pte_at(). > Yes, my mistake. >> 3. Restructure the pagetable setup code so that the mm is not added >> to the prio tree until after arch_dup_mmap has been called (and >> the converse for exit_mmap). This is arguably cleaner, but I >> haven't looked to see how much trouble this would be. >> > > No. It is intentional that we make those ptes visible as early as > possible, so that concurrent pageout (and less importantly swapoff) > has the best chance of finding all references to a page (or swap ent). > If they only became visible at the final arch_dup_mmap stage, then > it might become impossible to fork a large well-populated mm, if it > contains those very pages which need to be freed to make space to > allocate pagetables for the child. Hm, OK. J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/