RE: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

Liu, Monk Wed, 01 Sep 2021 03:19:38 -0700

[AMD Official Use Only]

Daniel

>From the link you share it looks you(or someone else) have quite a bunch 
>patches that changes DRM_SCHED or even amdgpu, by that case before they are 
>merged to kernel tree I'm wondering if any AMD develop reviewed them ?

They looks to me somehow conflicting with what we changed in our repo.... 

It is really a chaos for AMDer if someone else out side of AMD changes our 
kernel driver (or/and scheduler) without reviewed by AMDer, just like we are 
requiring your review if we tend to change scheduler's logic here .... 

This one changes AMD's code: 
https://lore.kernel.org/dri-devel/20210625133327.2598825-2-boris.brezil...@collabora.com/
And I didn't see any reviewed-by from AMDers ...

This one also touches AMD's code: 
https://lore.kernel.org/dri-devel/20200604081224.863494-12-daniel.vet...@ffwll.ch/
Which is conflicting with one patch we submitted (in our repo rightnow), and 
neither see AMDder gave a review-by on this one (let me know if I missed it)

Thanks 

------------------------------------------
Monk Liu | Cloud-GPU Core team
------------------------------------------

-----Original Message-----
From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> On Behalf Of Daniel Vetter
Sent: Wednesday, September 1, 2021 4:18 PM
To: Liu, Monk <monk....@amd.com>
Cc: Koenig, Christian <christian.koe...@amd.com>; Grodzovsky, Andrey 
<andrey.grodzov...@amd.com>; Chen, JingWen <jingwen.ch...@amd.com>; DRI 
Development <dri-de...@lists.freedesktop.org>; amd-gfx@lists.freedesktop.org
Subject: Re: [diagnostic TDR mode patches] unify our solution 
opinions/suggestions in one thread

Hi Monk,

On Wed, Sep 1, 2021 at 3:23 AM Liu, Monk <monk....@amd.com> wrote:
>
> [AMD Official Use Only]
>
>
> Hi Daniel/Christian/Andrey
>
>
>
> It looks the voice from you three are spread over those email floods to me, 
> the feature we are working on (diagnostic TDR scheme) is pending there for 
> more than 6 month (we started it from feb 2021).

For me your project exists since a few weeks at most, because that is when your 
team showed up on dri-devel. That you already spent 6 months on this within 
amd, on a code area that very much affects shared code, without kicking of any 
thread on dri-devel isn't great, but also not something we can fix, since time 
machines don't exist.

So we have to make the best out of the situation and move ahead where we are. 
From my understanding you've done a bunch of changes to the scheduler code. As 
far as I can see there's been two related things your team has done:

- remove some allocations from scheduler code, because that can lead to 
deadlocks. I've kicked up this topic quite a while ago here

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20200604081224.863494-10-daniel.vetter%40ffwll.ch%2F&amp;data=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=pG5sG5pjVXEAMaahvfNS11VwbHkYWRuWrtHFXM9mEyo%3D&amp;reserved=0

This is just one patch of the entire series. This is an area where we really 
need a consistent solution across all drm/sched drivers, not something that 
individual drivers just fix in their own way.

- the other one is the timeout issue for the patches you cite here.
Again there's been discussion on this on dri-devel with Boris from panfrost 
about how we can handle at least some of the races in tdr.
That resulted in lots of discussions and documentation improvements.
Those patches are merged now, link
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210625133327.2598825-2-boris.brezillon%40collabora.com%2F&amp;data=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=m6U6tJbX2x38xiwQXE1oV0sz2BxXZfPlcouyqIqPZNU%3D&amp;reserved=0

There's been more than just this, also quite some doc patches from Boris that 
explain how it's all supposed to work and be race-free.
Again your driver isn't the only one with interesting TDR races.

Your team hasn't been active in any of these discussions, but now suddenly pops 
up out of nowhere and demands that your approach needs to land asap. That's 
really not how upstream works.

The other thing where I'm struggling is that there's a lot of missing context 
for outsiders. The patches sometimes come with zero commit message, for tricky 
concurrency bugs. And there's no context with what you've done already on the 
amdgpu side (since that never showed up on dri-devel), which makes constructive 
discussions here really hard.

Now fixing these bugs is obviously good, but the way this is supposed to work 
when touching shared infrastructure is:

- Before you start merging anything kick off an RFC thread on dri-devel (or 
whatever the topic really is about) about the problem you have and how your 
trying to solve it. This can be just text if it's a big thing, but it can also 
already include some proof of concept solution in the form of patches.

- Then we iterate on the solution, across drivers and shared code _together_. 
Not "merge amdgpu code first, then get annoyed when the core changes don't land 
immediately after you've practially finished the project".

- This might mean changes to other drivers if we need to adjust interfaces.

On the plus side you can plan much better, because you know you have upstream 
buy-in before you start to put in real work on the project.

> Honestly speaking the email ways that we are using now is not friendly and 
> quite painful to me ....

Yes this is painful :-(

I think the best way forward is to go through the above process again and 
essentially restart. So submit a complete patch series with problem 
descriptions, solution you picked, why you picked that, all the amdgpu patches 
to get there and the core patches too. Since it sounds like a bunch of this has 
all landed already you probably need a patch 1 that goes back to 6 months ago 
so that we can see the overall direction, and review whether that's the right 
one or not.

The not-so-painful approach would have been to do this from the start,
6 months ago. It would definitely have helped if the tdr discussion we've had 
just a few months ago would have involved your team too, I'm sure there would 
have been some good insights from amd's side. I'd really want you and your 
engineers involved here, so let's do this properly!

Cheers, Daniel

> Can we try to put all our opinions, suggestions, or even objects here 
> together, let's go through them one by one, it's too hard for us to reply 
> each email on different questions .
>
>
>
> For [PATCH 1/2] drm/sched: fix the bug of time out calculation(v4)
>
>
>
> This is a fixing patch on the timeout timer in scheduler, can we complete 
> this one first ? it should already resolved all the questions and suggestions.
>
>
>
> For [PATCH 2/2] drm/sched: serialize job_timeout and scheduler
>
>
>
> I think I already explained the questions raised by Daniel in other 
> thread , regarding why I use __kthread_should_park()
>
> For other aspects, can we put all our opinion synthesized here ?
>
>
>
> Thanks !
>
>
>
> ------------------------------------------
>
> Monk Liu | Cloud-GPU Core team
>
> ------------------------------------------
>
>

--
Daniel Vetter
Software Engineer, Intel Corporation
https://nam11.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.ffwll.ch%2F&amp;data=04%7C01%7Cmonk.liu%40amd.com%7Cd90ad990ac1a499c266208d96d21138d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637660811106940372%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=NA3iopIUYFOuTokczRA%2BNBcwVrvMMMHGPM96%2B%2Bm0nEg%3D&amp;reserved=0

RE: [diagnostic TDR mode patches] unify our solution opinions/suggestions in one thread

Reply via email to