Re: [PATCH 08/16] gpu: nova-core: wait for GFW_BOOT completion

Joel Fernandes Mon, 21 Apr 2025 14:46:07 -0700

Hi, Alex,
Just some documentation-type comments and one Rust-naming-convention comment:

On 4/20/2025 8:19 AM, Alexandre Courbot wrote:
> Upon reset, the GPU executes the GFW_BOOT firmware in order to
> initialize its base parameters such as clocks. The driver must ensure
> that this step is completed before using the hardware.
> 
> Signed-off-by: Alexandre Courbot <acour...@nvidia.com>
> ---
>  drivers/gpu/nova-core/devinit.rs   | 40 
> ++++++++++++++++++++++++++++++++++++++
>  drivers/gpu/nova-core/driver.rs    |  2 +-
>  drivers/gpu/nova-core/gpu.rs       |  5 +++++
>  drivers/gpu/nova-core/nova_core.rs |  1 +
>  drivers/gpu/nova-core/regs.rs      | 11 +++++++++++
>  5 files changed, 58 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/nova-core/devinit.rs 
> b/drivers/gpu/nova-core/devinit.rs
> new file mode 100644
> index 
> 0000000000000000000000000000000000000000..ee5685aff845aa97d6b0fbe9528df9a7ba274b2c
> --- /dev/null
> +++ b/drivers/gpu/nova-core/devinit.rs
> @@ -0,0 +1,40 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +//! Methods for device initialization.

Let us clarify what devinit means.

devinit is a sequence of register read/writes after reset that performs tasks
such as:
1. Programming VRAM memory controller timings.
2. Power sequencing.
3. Clock and PLL configuration.
4. Thermal management.
5. Performs VRAM memory scrubbing (ECC initialization) - on some GPUs, it scrubs
only part of memory and then kicks of 'async scrubbing'.

devinit itself is a 'script' which is interpreted by the PMU microcontroller of
of the GPU by an interpreter program.

Note that devinit also needs to run during suspend/resume at runtime.

I talked with Alex and I could add a new patch on top of this patch to add these
clarifying 'doc' comments as well. I will commit them to my git branch and send
on top of this as needed, but Alex can feel free to decide to squash them as 
well.

> +
> +use kernel::bindings;
> +use kernel::devres::Devres;
> +use kernel::prelude::*;
> +
> +use crate::driver::Bar0;
> +use crate::regs;
> +
> +/// Wait for devinit FW completion.
> +///
> +/// Upon reset, the GPU runs some firmware code to setup its core 
> parameters. Most of the GPU is
> +/// considered unusable until this step is completed, so it must be waited 
> on very early during
> +/// driver initialization.
> +pub(crate) fn wait_gfw_boot_completion(bar: &Devres<Bar0>) -> Result<()> {

To reduce acronym soup, we can clarify gfw means 'GPU firmware', it is a broad
term used for VBIOS ROM components several of which execute before the driver
loads. Perhaps that part of comment can be 'the GPU firmware (gfw) code'.

> +    let mut timeout = 2000;
> +
> +    loop {
> +        let gfw_booted = with_bar!(
> +            bar,
> +            |b| regs::Pgc6AonSecureScratchGroup05PrivLevelMask::read(b)

Per my research, FWSEC is run before hand on the GSP in 'high secure' mode,
before the driver even loads. This happens roughly around the time devinit is
also happening (not sure if it before or after). This FWSEC is supposed to lower
the privilege level of the access to 'Pgc6AonSecureScratchGroup05' so that the
register is accessible by the CPU. I think we should mention that here as
rationale for why we need to read Pgc6AonSecureScratchGroup05PrivLevelMask first
before accessing Pgc6AonSecureScratchGroup05.

Here we should say we need to read the GFW_BOOT only once we know that the
privilege level has been reduced by the FWSEC

> +                .read_protection_level0_enabled()
> +                && (regs::Pgc6AonSecureScratchGroup05::read(b).value() & 
> 0xff) == 0xff

I find this Rust convention for camel casing long constants very unreadable and
troubling: Pgc6AonSecureScratchGroup05. I think we should relax this requirement
for sake of readability. Could the Rust community / maintainers provide some 
input?

Apart from readability, it also makes searching for the same register name a
nightmare with other code bases written in C.

Couple of ideas discussed:

1. May be have a macro that converts
REG(NV_PGC6_AON_SECURE_SCRATCH_GROUP_05_PRIV_LEVEL_MASK) ->
regs::Pgc6AonSecureScratchGroup05 ?
But not sure what it takes on the rust side to implement a macro like that.

2. Adding doc comments both in regs.rs during defining the register, and
possibly at the caller site. This still does address the issue fully.

> +        )?;
> +
> +        if gfw_booted {
> +            return Ok(());
> +        }
> +
> +        if timeout == 0 {
> +            return Err(ETIMEDOUT);
> +        }
> +        timeout -= 1;
> +
> +        // SAFETY: msleep should be safe to call with any parameter.
> +        unsafe { bindings::msleep(2) };
> +    }
> +}
> diff --git a/drivers/gpu/nova-core/driver.rs b/drivers/gpu/nova-core/driver.rs
> index 
> a08fb6599267a960f0e07b6efd0e3b6cdc296aa4..752ba4b0fcfe8d835d366570bb2f807840a196da
>  100644
> --- a/drivers/gpu/nova-core/driver.rs
> +++ b/drivers/gpu/nova-core/driver.rs
> @@ -10,7 +10,7 @@ pub(crate) struct NovaCore {
>      pub(crate) gpu: Gpu,
>  }
>  
> -const BAR0_SIZE: usize = 8;
> +const BAR0_SIZE: usize = 0x1000000;
>  pub(crate) type Bar0 = pci::Bar<BAR0_SIZE>;
>  
>  kernel::pci_device_table!(
> diff --git a/drivers/gpu/nova-core/gpu.rs b/drivers/gpu/nova-core/gpu.rs
> index 
> 866c5992b9eb27735975bb4948e522bc01fadaa2..1f7799692a0ab042f2540e01414f5ca347ae9ecc
>  100644
> --- a/drivers/gpu/nova-core/gpu.rs
> +++ b/drivers/gpu/nova-core/gpu.rs
> @@ -2,6 +2,7 @@
>  
>  use kernel::{device, devres::Devres, error::code::*, pci, prelude::*};
>  
> +use crate::devinit;
>  use crate::driver::Bar0;
>  use crate::firmware::Firmware;
>  use crate::regs;
> @@ -168,6 +169,10 @@ pub(crate) fn new(
>              spec.revision
>          );
>  
> +        // We must wait for GFW_BOOT completion before doing any significant 
> setup on the GPU.
> +        devinit::wait_gfw_boot_completion(&bar)
> +            .inspect_err(|_| pr_err!("GFW boot did not complete"))?;
> +
>          Ok(pin_init!(Self { spec, bar, fw }))
>      }
>  }
> diff --git a/drivers/gpu/nova-core/nova_core.rs 
> b/drivers/gpu/nova-core/nova_core.rs
> index 
> 0eecd612e34efc046dad852e6239de6ffa5fdd62..878161e060f54da7738c656f6098936a62dcaa93
>  100644
> --- a/drivers/gpu/nova-core/nova_core.rs
> +++ b/drivers/gpu/nova-core/nova_core.rs
> @@ -20,6 +20,7 @@ macro_rules! with_bar {
>      }
>  }
>  
> +mod devinit;
>  mod driver;
>  mod firmware;
>  mod gpu;
> diff --git a/drivers/gpu/nova-core/regs.rs b/drivers/gpu/nova-core/regs.rs
> index 
> e315a3011660df7f18c0a3e0582b5845545b36e2..fd7096f0ddd4af90114dd1119d9715d2cd3aa2ac
>  100644
> --- a/drivers/gpu/nova-core/regs.rs
> +++ b/drivers/gpu/nova-core/regs.rs
> @@ -13,3 +13,14 @@
>      7:4     major_rev => as u8, "major revision of the chip";
>      28:20   chipset => try_into Chipset, "chipset model"
>  );
> +
> +/* GC6 */

GC6 is a GPU low-power state. The VRAM is in self-refresh and GPU itself is
powered down (all power rails not required for self-refresh).

The following registers are exposed by the hardware unit in the GPU which
manages the GC6 state transitions:

> +
> +register!(Pgc6AonSecureScratchGroup05PrivLevelMask@0x00118128;
> +    0:0     read_protection_level0_enabled => as_bit bool
> +);

This is a privilege-level-mask register, which dictates whether the host CPU can
access the register.

> +
> +/* TODO: This is an array of registers. */
> +register!(Pgc6AonSecureScratchGroup05@0x00118234;
> +    31:0    value => as u32
> +);
> 

These are always-on registers always available including in the GC6 state (which
makes sense since we need to access it to know if we are far enough in the boot
process).

thanks,

 - Joel

Re: [PATCH 08/16] gpu: nova-core: wait for GFW_BOOT completion

Reply via email to