Am 24.01.23 um 09:27 schrieb Marek Olšák:
A new Gallium HUD "value producer" could be added that reads fdinfo without calling the driver.

That sounds good. To be honest I would have plenty of use for that.

I still think there is merit in having this in amdgpu_drm.h too.

Why?

Christian.


Marek

On Tue, Jan 24, 2023 at 3:13 AM Marek Olšák <mar...@gmail.com> wrote:

    The table of exposed driver-specific counters:
    
https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/gallium/drivers/radeonsi/si_query.c#L1751

    Counter enums. They use the same interface as e.g. occlusion
    queries, except that begin_query and end_query save the results in
    the driver/CPU.
    
https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/gallium/drivers/radeonsi/si_query.h#L45

    Counters exposed by the winsys:
    
https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/gallium/include/winsys/radeon_winsys.h#L126

    I just need to query the counters in the winsys and return them.

    Marek

    On Tue, Jan 24, 2023 at 2:58 AM Christian König
    <ckoenig.leichtzumer...@gmail.com> wrote:

        How are the counters which the HUD consumes declared?

        See what I want to avoid is a) to nail down the interface with
        the kernel on specific values and b) make it possible to
        easily expose new values.

        In other words what we could do with fdinfo is to have
        something like this:

        GALLIUM_FDINFO_HUD=drm-memory-vram,amd-evicted-vram,amd-mclk
        glxgears

        And the HUD just displays the values the kernel provides
        without the need to re-compile mesa when we want to add some
        more values nor have the values as part of the UAPI.

        Christian.

        Am 24.01.23 um 08:37 schrieb Marek Olšák:
        The Gallium HUD doesn't consume strings. It only consumes
        values that are exposed as counters from the driver. In this
        case, we need the driver to expose evicted stats as counters.
        Each counter can set whether the value is absolute (e.g.
        memory usage) or monotonic (e.g. perf counter). Parsing
        fdinfo to get the values is undesirable.

        Marek

        On Mon, Jan 23, 2023 at 4:31 AM Christian König
        <ckoenig.leichtzumer...@gmail.com> wrote:

            Let's do this as valid in fdinfo.

            This way we can easily extend whatever the kernel wants
            to display as statistics in the userspace HUD.

            Regards,
            Christian.

            Am 21.01.23 um 01:45 schrieb Marek Olšák:
            We badly need a way to query evicted memory usage. It's
            essential for investigating performance problems and it
            uncovered the buddy allocator disaster. Please either
            suggest an alternative, suggest changes, or review. We
            need it ASAP.

            Thanks,
            Marek

            On Tue, Jan 10, 2023 at 11:55 AM Marek Olšák
            <mar...@gmail.com> wrote:

                On Tue, Jan 10, 2023 at 11:23 AM Christian König
                <ckoenig.leichtzumer...@gmail.com> wrote:

                    Am 10.01.23 um 16:28 schrieb Marek Olšák:
                    On Wed, Jan 4, 2023 at 9:51 AM Christian König
                    <ckoenig.leichtzumer...@gmail.com> wrote:

                        Am 04.01.23 um 00:08 schrieb Marek Olšák:
                        I see about the access now, but did you
                        even look at the patch?

                        I did look at the patch, but I haven't
                        fully understood yet what you are trying to
                        do here.


                    First and foremost, it returns the evicted size
                    of VRAM and visible VRAM, and returns visible
                    VRAM usage. It should be obvious which stat
                    includes the size of another.


                        Because what the patch does isn't even
                        exposed to common drm code, such as the
                        preferred domain and visible VRAM
                        placement, so it can't be in fdinfo right now.

                        Or do you even know what fdinfo contains?
                        Because it contains nothing useful. It
                        only has VRAM and GTT usage, which we
                        already have in the INFO ioctl, so it has
                        nothing that we need. We mainly need the
                        eviction information and visible VRAM
                        information now. Everything else is a bonus.

                        Well the main question is what are you
                        trying to get from that information? The
                        eviction list for example is completely
                        meaningless to userspace, that stuff is
                        only temporary and will be cleared on the
                        next CS again.


                    I don't know what you mean. The returned
                    eviction stats look correct and are stable
                    (they don't change much). You can suggest
                    changes if you think some numbers are not
                    reported correctly.


                        What we could expose is the VRAM
                        over-commit value, e.g. how much BOs which
                        where supposed to be in VRAM are in GTT
                        now. I think that's what you are looking
                        for here, right?


                    The VRAM overcommit value is "evicted_vram".


                        Also, it's undesirable to open and parse a
                        text file if we can just call an ioctl.

                        Well I see the reasoning for that, but I
                        also see why other drivers do a lot of the
                        stuff we have as IOCTL as separate files in
                        sysfs, fdinfo or debugfs.

                        Especially repeating all the static
                        information which were already available
                        under sysfs in the INFO IOCTL was a design
                        mistake as far as I can see. Just compare
                        what AMDGPU and the KFD code is doing to
                        what for example i915 is doing.

                        Same for things like debug information
                        about a process. The fdinfo stuff can be
                        queried from external tools (gdb, gputop,
                        umr etc...) as well which makes that
                        interface more preferred.


                    Nothing uses fdinfo in Mesa. No driver uses
                    sysfs in Mesa except drm shims, noop drivers,
                    and Intel for perf metrics. sysfs itself is an
                    unusable mess for the PCIe query and is missing
                    information.

                    I'm not against exposing more stuff through
                    sysfs and fdinfo for tools, but I don't see any
                    reason why drivers should use it (other than
                    for slowing down queries and initialization).

                    That's what I'm asking: Is this for some tool or
                    to make some driver decision based on it?

                    If you just want the numbers for over displaying
                    then I think it would be better to put this into
                    fdinfo together with the other existing stuff there.


                    If you want to make allocation decisions based
                    on this then we should have that as IOCTL or
                    even better as mmap() page between kernel and
                    userspace. But in this case I would also
                    calculation the numbers completely different as
                    well.

                    See we have at least the following things in the
                    kernel:
                    1. The eviction list in the VM.
                        Those are the BOs which are currently
                    evicted and tried to moved back in on the next CS.

                    2. The VRAM over commit value.
                        In other words how much more VRAM than
                    available has the application tried to allocate?

                    3. The visible VRAM usage by this application.

                    The end goal is that the eviction list will go
                    away, e.g. we will always have stable
                    allocations based on allocations of other
                    applications and not constantly swap things in
                    and out.

                    When you now expose the eviction list to
                    userspace we will be stuck with this interface
                    forever.


                It's for the GALLIUM HUD.

                The only missing thing is the size of all evicted
                VRAM allocations, and the size of all evicted
                visible VRAM allocations.

                1. No list is exposed. Only sums of buffer sizes are
                exposed. Also, the eviction list has no meaning
                here. All lists are treated equally, and mem_type is
                compared with preferred_domains to determine where
                buffers are and where they should be.

                2. I'm not interested in the overcommit value. I'm
                only interested in knowing the number of bytes of
                evicted VRAM right now. It can be as variable as the
                CPU load, but in practice it shouldn't be because
                PCIe doesn't have the bandwidth to move things quickly.

                3. Yes, that's true.

                Marek



Reply via email to