How are the counters which the HUD consumes declared?

See what I want to avoid is a) to nail down the interface with the kernel on specific values and b) make it possible to easily expose new values.

In other words what we could do with fdinfo is to have something like this:

GALLIUM_FDINFO_HUD=drm-memory-vram,amd-evicted-vram,amd-mclk glxgears

And the HUD just displays the values the kernel provides without the need to re-compile mesa when we want to add some more values nor have the values as part of the UAPI.

Christian.

Am 24.01.23 um 08:37 schrieb Marek Olšák:
The Gallium HUD doesn't consume strings. It only consumes values that are exposed as counters from the driver. In this case, we need the driver to expose evicted stats as counters. Each counter can set whether the value is absolute (e.g. memory usage) or monotonic (e.g. perf counter). Parsing fdinfo to get the values is undesirable.

Marek

On Mon, Jan 23, 2023 at 4:31 AM Christian König <ckoenig.leichtzumer...@gmail.com> wrote:

    Let's do this as valid in fdinfo.

    This way we can easily extend whatever the kernel wants to display
    as statistics in the userspace HUD.

    Regards,
    Christian.

    Am 21.01.23 um 01:45 schrieb Marek Olšák:
    We badly need a way to query evicted memory usage. It's essential
    for investigating performance problems and it uncovered the buddy
    allocator disaster. Please either suggest an alternative, suggest
    changes, or review. We need it ASAP.

    Thanks,
    Marek

    On Tue, Jan 10, 2023 at 11:55 AM Marek Olšák <mar...@gmail.com>
    wrote:

        On Tue, Jan 10, 2023 at 11:23 AM Christian König
        <ckoenig.leichtzumer...@gmail.com> wrote:

            Am 10.01.23 um 16:28 schrieb Marek Olšák:
            On Wed, Jan 4, 2023 at 9:51 AM Christian König
            <ckoenig.leichtzumer...@gmail.com> wrote:

                Am 04.01.23 um 00:08 schrieb Marek Olšák:
                I see about the access now, but did you even look
                at the patch?

                I did look at the patch, but I haven't fully
                understood yet what you are trying to do here.


            First and foremost, it returns the evicted size of VRAM
            and visible VRAM, and returns visible VRAM usage. It
            should be obvious which stat includes the size of another.


                Because what the patch does isn't even exposed to
                common drm code, such as the preferred domain and
                visible VRAM placement, so it can't be in fdinfo
                right now.

                Or do you even know what fdinfo contains? Because
                it contains nothing useful. It only has VRAM and
                GTT usage, which we already have in the INFO ioctl,
                so it has nothing that we need. We mainly need the
                eviction information and visible VRAM information
                now. Everything else is a bonus.

                Well the main question is what are you trying to get
                from that information? The eviction list for example
                is completely meaningless to userspace, that stuff
                is only temporary and will be cleared on the next CS
                again.


            I don't know what you mean. The returned eviction stats
            look correct and are stable (they don't change much).
            You can suggest changes if you think some numbers are
            not reported correctly.


                What we could expose is the VRAM over-commit value,
                e.g. how much BOs which where supposed to be in VRAM
                are in GTT now. I think that's what you are looking
                for here, right?


            The VRAM overcommit value is "evicted_vram".


                Also, it's undesirable to open and parse a text
                file if we can just call an ioctl.

                Well I see the reasoning for that, but I also see
                why other drivers do a lot of the stuff we have as
                IOCTL as separate files in sysfs, fdinfo or debugfs.

                Especially repeating all the static information
                which were already available under sysfs in the INFO
                IOCTL was a design mistake as far as I can see. Just
                compare what AMDGPU and the KFD code is doing to
                what for example i915 is doing.

                Same for things like debug information about a
                process. The fdinfo stuff can be queried from
                external tools (gdb, gputop, umr etc...) as well
                which makes that interface more preferred.


            Nothing uses fdinfo in Mesa. No driver uses sysfs in
            Mesa except drm shims, noop drivers, and Intel for perf
            metrics. sysfs itself is an unusable mess for the PCIe
            query and is missing information.

            I'm not against exposing more stuff through sysfs and
            fdinfo for tools, but I don't see any reason why drivers
            should use it (other than for slowing down queries and
            initialization).

            That's what I'm asking: Is this for some tool or to make
            some driver decision based on it?

            If you just want the numbers for over displaying then I
            think it would be better to put this into fdinfo together
            with the other existing stuff there.


            If you want to make allocation decisions based on this
            then we should have that as IOCTL or even better as
            mmap() page between kernel and userspace. But in this
            case I would also calculation the numbers completely
            different as well.

            See we have at least the following things in the kernel:
            1. The eviction list in the VM.
                Those are the BOs which are currently evicted and
            tried to moved back in on the next CS.

            2. The VRAM over commit value.
                In other words how much more VRAM than available has
            the application tried to allocate?

            3. The visible VRAM usage by this application.

            The end goal is that the eviction list will go away, e.g.
            we will always have stable allocations based on
            allocations of other applications and not constantly swap
            things in and out.

            When you now expose the eviction list to userspace we
            will be stuck with this interface forever.


        It's for the GALLIUM HUD.

        The only missing thing is the size of all evicted VRAM
        allocations, and the size of all evicted visible VRAM
        allocations.

        1. No list is exposed. Only sums of buffer sizes are exposed.
        Also, the eviction list has no meaning here. All lists are
        treated equally, and mem_type is compared with
        preferred_domains to determine where buffers are and where
        they should be.

        2. I'm not interested in the overcommit value. I'm only
        interested in knowing the number of bytes of evicted VRAM
        right now. It can be as variable as the CPU load, but in
        practice it shouldn't be because PCIe doesn't have the
        bandwidth to move things quickly.

        3. Yes, that's true.

        Marek


Reply via email to