TTM Locking order of bo::reserve -> vm::mmap_sem

Thomas Hellstrom Mon, 21 Oct 2013 20:20:10 +0200

On 10/21/2013 03:21 PM, Maarten Lankhorst wrote:
> op 21-10-13 13:10, Thomas Hellstrom schreef:
>> On 10/21/2013 12:24 PM, Maarten Lankhorst wrote:
>>> op 21-10-13 12:10, Thomas Hellstrom schreef:
>>>> On 10/21/2013 11:48 AM, Maarten Lankhorst wrote:
>>>>> op 21-10-13 11:37, Thomas Hellstrom schreef:
>>>>>> On 10/21/2013 11:01 AM, Maarten Lankhorst wrote:
>>>>>>> op 21-10-13 10:48, Thomas Hellstrom schreef:
>>>>>>>> Hi!
>>>>>>>>
>>>>>>>> As discussed previously the current locking order in TTM of these 
>>>>>>>> locks is bo::reserve -> vm::mmap_sem. This leads to a hack in
>>>>>>>> the TTM fault() handle to try and revert the locking order. If a 
>>>>>>>> tryreserve failed, we tried to have the vm code release the mmap_sem() 
>>>>>>>> and then schedule, to give the holder of bo::reserve a chance to 
>>>>>>>> release the lock. This solution is no longer legal, since we've been 
>>>>>>>> more or less kindly asked to remove the set_need_resched() call.
>>>>>>>>
>>>>>>>> Maarten has proposed to invert the locking order. I've previously said 
>>>>>>>> I had no strong preference. The current locking order dates back from 
>>>>>>>> the time when TTM wasn't using unmap_mapping_range() but walked the 
>>>>>>>> page tables itself, updating PTEs as needed. Furthermore it was needed 
>>>>>>>> for user bos that used get_user_pages() in the TTM populate and 
>>>>>>>> swap-in methods. User-bos were removed some time ago but I'm looking 
>>>>>>>> at re-adding them. They would suite the VMware model of cached-only 
>>>>>>>> pages very well. I see uses both in the gallium API, XA's DMA 
>>>>>>>> functionality and openCL.
>>>>>>>>
>>>>>>>> We would then need a somewhat nicer way to invert the locking order. 
>>>>>>>> I've attached a solution that ups the mmap_sem and then reserves, but 
>>>>>>>> due to how the fault API is done, we then need to release the reserve 
>>>>>>>> and retry the fault. This of course opens up for starvation, but I 
>>>>>>>> don't think starvation at this point is very likely: One thread being 
>>>>>>>> refused to write or read from a buffer object because the GPU is 
>>>>>>>> continously busy with it. If this *would* become a problem, it's 
>>>>>>>> probably possible to modify the fault code to allow us to hold locks 
>>>>>>>> until the retried fault, but that would be a bit invasive, since it 
>>>>>>>> touches the arch code....
>>>>>>>>
>>>>>>>> Basically I'm proposing to keep the current locking order.
>>>>>>> I'm not sure why we have to worry about mmap_sem lock being taken 
>>>>>>> before bo::reserve. If we already hold mmap_sem,
>>>>>>> no extra locking is needed for get_user_pages.
>>>>>> Typically, they are populated outside of fault, as part of execbuf, 
>>>>>> where we don't hold and don't want to hold mmap_sem(). In fact,
>>>>>> user bo's should not be remappable through the TTM VM system. Anyway, we 
>>>>>> need to grab the mmap_sem inside ttm_populate for user buffers.
>>>>> If we don't allow mmapping user bo's through TTM, we can use special 
>>>>> lockdep annotation when user-bo's are used. Normal bo's would have
>>>>> mmap_sem outer lock, bo::reserve inner lock, while those bo's would have 
>>>>> the other way around.
>>>>>
>>>>> This might complicate validation a little, since you would have to 
>>>>> reserve and validate all user-bo's before any normal bo's are reserved. 
>>>>> But since this
>>>>> is meant to be a vmwgfx specific optimization I think it might be worth 
>>>>> it.
>>>> Would that work (lockdep-wise) with user BO swapout as part of a normal BO 
>>>> validation? During user BO swapout, we don't need mmap_sem, but the BO 
>>>> needs to be reserved. But I guess it would work, since we use tryreserve 
>>>> when walking the LRU lists?
>>> Correct.
>>>
>>>>>>>      Releasing it is a bit silly. I think we should keep mmap_sem as 
>>>>>>> outer
>>>>>>> lock, and have bo::reserve as inner, even if it might complicate 
>>>>>>> support for user-bo's. I'm not sure what you can do
>>>>>>> with user-bo's that can't be done by allocating the same bo from kernel 
>>>>>>> first and map + populate it.
>>>>>>>
>>>>>>> ~Maarten
>>>>>> Using DMA API analogy, user BOs correspond to using streaming DMA 
>>>>>> whereas normal BOs correspond to alloced DMA memory buffers.
>>>>>> We boost performance and save resources.
>>>>> Yeah but it's vmwgfx specific. Nouveau and radeon have dedicated copy 
>>>>> engines that can be used. Flushing the vm and stalling is probably more
>>>>> expensive than performing a memcpy
>>>> In the end, I'm not sure it will be vmwgfx specific once there is a 
>>>> working example, and there are user-space APIs that will benefit from it. 
>>>> There are other examples out there today that uses streaming DMA to feed 
>>>> the DMA engines, although not through TTM, see for example via_dmablit.c.
>>>>
>>>> Also if we need a separate locking order for User BOs, what would be the 
>>>> big benefit of having the locking order mmap_sem()->bo_reserve() ?
>>> Deterministic locking. I fear about livelocks that could happen otherwise 
>>> with the right timing. Especially but not exclusively on -rt kernels.
>> But livelocks wouldn't be an issue anymore since we use a waiting reserve in 
>> the retry path, right? The only issue we might theoretically be facing is 
>> starvation in the fault path.
>>
>> With a two-step validation scheme I think we're up to real issues, like 
>> bouncing ordinary BOs in and out of swap during a single execbuf...
> I would need to see some code, but I think it can be avoided by having a 
> slowpath similar to i915 if it fails.


There's no slowpath needed, but the following situation might happen. 
Let's assume there is a memory shortage.

1. user_bo reference
2. bo reference
3. user_bo validate. bos are swapped out.
4. bo validate bos are swapped in again.

While it's not an error, it's not a situation we should get into.
>
> Anyway have 2 threads fault on the same bo in different offsets, so fault 
> handler is called twice on the same bo.
>
> You could end up a loop, first one grabbed the bo lock without mmap_sem, 
> second one does trylock, which fails then unlocks mmap_sem, and acquires the 
> bo lock again. At which point the first thread acquired mmap_sem and does a 
> trylock. So the same thing can theoretically happen infinitely over and over 
> again.

Indeed. This situation would be extremely unlikely but could however be 
easily fixed with wait_for_unreserved semantics instead of reserve; 
unreserve. Like in mm/filemap.c line 666 and onwards. Either by an API 
addition or using the current API. The latter would, however require an 
extra mutex per bo.

/Thomas



>
> ~Maarten

TTM Locking order of bo::reserve -> vm::mmap_sem

Reply via email to