On 07/11/23 11:00, Lazar, Lijo wrote: > > > On 11/1/2023 1:36 PM, Aravind Iddamsetty wrote: >> >> On 30/10/23 20:41, Lazar, Lijo wrote: >>> >>> >>> On 10/30/2023 11:49 AM, Aravind Iddamsetty wrote: >>>> >>>> On 26/10/23 15:34, Lazar, Lijo wrote: >>>> >>>> Hi Lijo, >>>> >>>> Thank you for your comments. >>>> >>>>> >>>>> >>>>> On 10/23/2023 8:59 PM, Alex Deucher wrote: >>>>>> On Fri, Oct 20, 2023 at 7:42 PM Aravind Iddamsetty >>>>>> <aravind.iddamse...@linux.intel.com> wrote: >>>>>>> >>>>>>> Our hardware supports RAS(Reliability, Availability, Serviceability) by >>>>>>> reporting the errors to the host, which the KMD processes and exposes a >>>>>>> set of error counters which can be used by observability tools to take >>>>>>> corrective actions or repairs. Traditionally there were being exposed >>>>>>> via PMU (for relative counters) and sysfs interface (for absolute >>>>>>> value) in our internal branch. But, due to the limitations in this >>>>>>> approach to use two interfaces and also not able to have an event based >>>>>>> reporting or configurability, an alternative approach to try netlink >>>>>>> was suggested by community for drm subsystem wide UAPI for RAS and >>>>>>> telemetry as discussed in [1]. >>>>>>> >>>>>>> This [1] is the inspiration to this series. It uses the generic >>>>>>> netlink(genl) family subsystem and exposes a set of commands that can >>>>>>> be used by every drm driver, the framework provides a means to have >>>>>>> custom commands too. Each drm driver instance in this example xe driver >>>>>>> instance registers a family and operations to the genl subsystem through >>>>>>> which it enumerates and reports the error counters. An event based >>>>>>> notification is also supported to which userpace can subscribe to and >>>>>>> be notified when any error occurs and read the error counter this avoids >>>>>>> continuous polling on error counter. This can also be extended to >>>>>>> threshold based notification. >>>>> >>>>> The commands used seems very limited. In AMD SOCs, IP blocks, instances >>>>> of IP blocks, block types which support RAS will change across >>>>> generations. >>>>> >>>>> This series has a single command to query the counters supported. Within >>>>> that it seems to assign unique ids for every combination of error type, >>>>> IP block type and then another for each instance. Not sure how good this >>>>> kind of approach is for an end user. The Ids won't necessarily the stay >>>>> the same across multiple generations. Users will generally be interested >>>>> in specific IP blocks. >>>> >>>> Exactly the IDs are UAPI and won't change once defined for a platform and >>>> any new SKU or platform will add on top of existing ones. Userspace can >>>> include the header and use the defines. The query is used to know what all >>>> errors exists on a platform and userspace can process the IDs of IP block >>>> of interest. I believe even if we list block wise a query will be needed >>>> without which userspace wouldn't know which blocks exist on a platform. >>>> >>> >>> What I meant is - assigning an id for every combination of IP block/ >>> instance number/error type is not maintainable across different SOCs. >>> >>> Instead, can we have something like - >>> Query -> returns IP block ids, number of instances, error types >>> supported by each IP block. >>> Read Error -> IP block id | Instance number /Instance ALL | Error type >>> id/Error type ALL. >> >> Hi Lijo, >> >> Would you please elaborate more on what is the issue you fore see with the >> maintainability. But I have a query on the model suggested >> >> This might work well with user input based tools, but don't think it suits >> if we want to periodically read a particular counter. >> >> The inspiration to have ID for each is taken from PMU subsystem where every >> event has an ID and a flat list so no multiple queries and we can read them >> individually or group together >> which can be achieved via READ_MULTI command I proposed earlier. >> > > The problem is mainly with maintaining a static list including all ip_id | > instance | err_type combinations. Instead, preference is for client to query > the capabilities -> instance/error types supported and then use that info > later to fetch error info. > > Capability query could return something like ip block, total instance > available and error types supported. This doesn't require to maintain an ID > list for each combination. > > The instances per SOC could be variable. For ex: it's not required that all > SKUs of your SOC type to have have ss0-ss3 HBMs. For the same SOC type or for > new SOC type, it could be more or less. > > Roughly something like .. > > enum ip_block_id > { > block1, > block2, > block3, > .... > block_all > } > > enum ip_sub_block_id (if required) > { > sub_block1, > sub_block2, > .... > sub_block_all > } > > #define INSTANCE_ALL -1 > > enum ras_error_type > { > correctable, > uncorrectable, > deferred, > fatal, > ... > err_all > } > > Then define something like below while querying error details. > > <31:24> = Block Id > <23:16> subblock id > <15:8> - interested instance > <7:0> - error_type > > Instance number could be 'inst_all' or specific IP instance. Hi Lijo,
Thanks for the explanation, will rework as suggested and re post a new series soon. Thanks, Aravind. > > Thanks, > Lijo > >> Thanks, >> Aravind. >>> >>> Thanks, >>> Lijo >>> >>>>> >>>>> For ex: to get HBM errors, it looks like the current patch series >>>>> supports READALL which dumps the whole set of errors. Or, users have to >>>>> figure out the ids of HBM stack instance (whose capacity can change >>>>> depending on the SOC and within a single family multiple configurations >>>>> can exist) errors and do multiple READ_ONE calls. Both don't look good. >>>>> >>>>> It would be better if the command argument format can be well defined so >>>>> that it can be queried based on IP block type, instance, and error types >>>>> supported (CE/UE/fatal/parity/deferred etc.). >>>> >>>> so to mitigate multiple read limitation, we can introduce a new GENL >>>> command like READ_MULTI which accepts a list of errors ids which userspace >>>> can pass and get all interested error counter as response at once. Also, >>>> listing individual errors helps if userspace wants to read a particular >>>> error at regular intervals. The intention is also to keep KMD logic >>>> simple, userspace can build required model on top of flat enumeration. >>>> >>>> Please let me know if this sounds reasonable to you. >>>> >>>> Thanks, >>>> Aravind. >>>>> >>>>> Thanks, >>>>> Lijo >>>>> >>>>>> >>>>>> @Hawking Zhang, @Lazar, Lijo >>>>>> >>>>>> Can you take a look at this series and API and see if it would align >>>>>> with our RAS requirements going forward? >>>>>> >>>>>> Alex >>>>>> >>>>>> >>>>>>> >>>>>>> [1]: >>>>>>> https://airlied.blogspot.com/2022/09/accelerators-bof-outcomes-summary.html >>>>>>> >>>>>>> this series is on top of >>>>>>> https://patchwork.freedesktop.org/series/125373/, >>>>>>> >>>>>>> v4: >>>>>>> 1. Rebase >>>>>>> 2. rename drm_genl_send to drm_genl_reply >>>>>>> 3. catch error from xa_store and handle appropriately >>>>>>> 4. presently xe_list_errors fills blank data for IGFX, prevent it by >>>>>>> having an early check of IS_DGFX (Michael J. Ruhl) >>>>>>> >>>>>>> v3: >>>>>>> 1. Rebase on latest RAS series for XE >>>>>>> 2. drop DRIVER_NETLINK flag and use the driver_genl_ops structure to >>>>>>> register to netlink subsystem >>>>>>> >>>>>>> v2: define common interfaces to genl netlink subsystem that all drm >>>>>>> drivers >>>>>>> can leverage. >>>>>>> >>>>>>> Below is an example tool drm_ras which demonstrates the use of the >>>>>>> supported commands. The tool will be sent to ML with the subject >>>>>>> "[RFC i-g-t v2 0/1] A tool to demonstrate use of netlink sockets to >>>>>>> read RAS error counters" >>>>>>> https://patchwork.freedesktop.org/series/118437/#rev2 >>>>>>> >>>>>>> read single error counter: >>>>>>> >>>>>>> $ ./drm_ras READ_ONE --device=drm:/dev/dri/card1 >>>>>>> --error_id=0x0000000000000005 >>>>>>> counter value 0 >>>>>>> >>>>>>> read all error counters: >>>>>>> >>>>>>> $ ./drm_ras READ_ALL --device=drm:/dev/dri/card1 >>>>>>> name config-id >>>>>>> counter >>>>>>> >>>>>>> error-gt0-correctable-guc >>>>>>> 0x0000000000000001 0 >>>>>>> error-gt0-correctable-slm >>>>>>> 0x0000000000000003 0 >>>>>>> error-gt0-correctable-eu-ic >>>>>>> 0x0000000000000004 0 >>>>>>> error-gt0-correctable-eu-grf >>>>>>> 0x0000000000000005 0 >>>>>>> error-gt0-fatal-guc >>>>>>> 0x0000000000000009 0 >>>>>>> error-gt0-fatal-slm >>>>>>> 0x000000000000000d 0 >>>>>>> error-gt0-fatal-eu-grf >>>>>>> 0x000000000000000f 0 >>>>>>> error-gt0-fatal-fpu >>>>>>> 0x0000000000000010 0 >>>>>>> error-gt0-fatal-tlb >>>>>>> 0x0000000000000011 0 >>>>>>> error-gt0-fatal-l3-fabric >>>>>>> 0x0000000000000012 0 >>>>>>> error-gt0-correctable-subslice >>>>>>> 0x0000000000000013 0 >>>>>>> error-gt0-correctable-l3bank >>>>>>> 0x0000000000000014 0 >>>>>>> error-gt0-fatal-subslice >>>>>>> 0x0000000000000015 0 >>>>>>> error-gt0-fatal-l3bank >>>>>>> 0x0000000000000016 0 >>>>>>> error-gt0-sgunit-correctable >>>>>>> 0x0000000000000017 0 >>>>>>> error-gt0-sgunit-nonfatal >>>>>>> 0x0000000000000018 0 >>>>>>> error-gt0-sgunit-fatal >>>>>>> 0x0000000000000019 0 >>>>>>> error-gt0-soc-fatal-psf-csc-0 >>>>>>> 0x000000000000001a 0 >>>>>>> error-gt0-soc-fatal-psf-csc-1 >>>>>>> 0x000000000000001b 0 >>>>>>> error-gt0-soc-fatal-psf-csc-2 >>>>>>> 0x000000000000001c 0 >>>>>>> error-gt0-soc-fatal-punit >>>>>>> 0x000000000000001d 0 >>>>>>> error-gt0-soc-fatal-psf-0 >>>>>>> 0x000000000000001e 0 >>>>>>> error-gt0-soc-fatal-psf-1 >>>>>>> 0x000000000000001f 0 >>>>>>> error-gt0-soc-fatal-psf-2 >>>>>>> 0x0000000000000020 0 >>>>>>> error-gt0-soc-fatal-cd0 >>>>>>> 0x0000000000000021 0 >>>>>>> error-gt0-soc-fatal-cd0-mdfi >>>>>>> 0x0000000000000022 0 >>>>>>> error-gt0-soc-fatal-mdfi-east >>>>>>> 0x0000000000000023 0 >>>>>>> error-gt0-soc-fatal-mdfi-south >>>>>>> 0x0000000000000024 0 >>>>>>> error-gt0-soc-fatal-hbm-ss0-0 >>>>>>> 0x0000000000000025 0 >>>>>>> error-gt0-soc-fatal-hbm-ss0-1 >>>>>>> 0x0000000000000026 0 >>>>>>> error-gt0-soc-fatal-hbm-ss0-2 >>>>>>> 0x0000000000000027 0 >>>>>>> error-gt0-soc-fatal-hbm-ss0-3 >>>>>>> 0x0000000000000028 0 >>>>>>> error-gt0-soc-fatal-hbm-ss0-4 >>>>>>> 0x0000000000000029 0 >>>>>>> error-gt0-soc-fatal-hbm-ss0-5 >>>>>>> 0x000000000000002a 0 >>>>>>> error-gt0-soc-fatal-hbm-ss0-6 >>>>>>> 0x000000000000002b 0 >>>>>>> error-gt0-soc-fatal-hbm-ss0-7 >>>>>>> 0x000000000000002c 0 >>>>>>> error-gt0-soc-fatal-hbm-ss1-0 >>>>>>> 0x000000000000002d 0 >>>>>>> error-gt0-soc-fatal-hbm-ss1-1 >>>>>>> 0x000000000000002e 0 >>>>>>> error-gt0-soc-fatal-hbm-ss1-2 >>>>>>> 0x000000000000002f 0 >>>>>>> error-gt0-soc-fatal-hbm-ss1-3 >>>>>>> 0x0000000000000030 0 >>>>>>> error-gt0-soc-fatal-hbm-ss1-4 >>>>>>> 0x0000000000000031 0 >>>>>>> error-gt0-soc-fatal-hbm-ss1-5 >>>>>>> 0x0000000000000032 0 >>>>>>> error-gt0-soc-fatal-hbm-ss1-6 >>>>>>> 0x0000000000000033 0 >>>>>>> error-gt0-soc-fatal-hbm-ss1-7 >>>>>>> 0x0000000000000034 0 >>>>>>> error-gt0-soc-fatal-hbm-ss2-0 >>>>>>> 0x0000000000000035 0 >>>>>>> error-gt0-soc-fatal-hbm-ss2-1 >>>>>>> 0x0000000000000036 0 >>>>>>> error-gt0-soc-fatal-hbm-ss2-2 >>>>>>> 0x0000000000000037 0 >>>>>>> error-gt0-soc-fatal-hbm-ss2-3 >>>>>>> 0x0000000000000038 0 >>>>>>> error-gt0-soc-fatal-hbm-ss2-4 >>>>>>> 0x0000000000000039 0 >>>>>>> error-gt0-soc-fatal-hbm-ss2-5 >>>>>>> 0x000000000000003a 0 >>>>>>> error-gt0-soc-fatal-hbm-ss2-6 >>>>>>> 0x000000000000003b 0 >>>>>>> error-gt0-soc-fatal-hbm-ss2-7 >>>>>>> 0x000000000000003c 0 >>>>>>> error-gt0-soc-fatal-hbm-ss3-0 >>>>>>> 0x000000000000003d 0 >>>>>>> error-gt0-soc-fatal-hbm-ss3-1 >>>>>>> 0x000000000000003e 0 >>>>>>> error-gt0-soc-fatal-hbm-ss3-2 >>>>>>> 0x000000000000003f 0 >>>>>>> error-gt0-soc-fatal-hbm-ss3-3 >>>>>>> 0x0000000000000040 0 >>>>>>> error-gt0-soc-fatal-hbm-ss3-4 >>>>>>> 0x0000000000000041 0 >>>>>>> error-gt0-soc-fatal-hbm-ss3-5 >>>>>>> 0x0000000000000042 0 >>>>>>> error-gt0-soc-fatal-hbm-ss3-6 >>>>>>> 0x0000000000000043 0 >>>>>>> error-gt0-soc-fatal-hbm-ss3-7 >>>>>>> 0x0000000000000044 0 >>>>>>> error-gt0-gsc-correctable-sram-ecc >>>>>>> 0x0000000000000045 0 >>>>>>> error-gt0-gsc-nonfatal-mia-shutdown >>>>>>> 0x0000000000000046 0 >>>>>>> error-gt0-gsc-nonfatal-mia-int >>>>>>> 0x0000000000000047 0 >>>>>>> error-gt0-gsc-nonfatal-sram-ecc >>>>>>> 0x0000000000000048 0 >>>>>>> error-gt0-gsc-nonfatal-wdg-timeout >>>>>>> 0x0000000000000049 0 >>>>>>> error-gt0-gsc-nonfatal-rom-parity >>>>>>> 0x000000000000004a 0 >>>>>>> error-gt0-gsc-nonfatal-ucode-parity >>>>>>> 0x000000000000004b 0 >>>>>>> error-gt0-gsc-nonfatal-glitch-det >>>>>>> 0x000000000000004c 0 >>>>>>> error-gt0-gsc-nonfatal-fuse-pull >>>>>>> 0x000000000000004d 0 >>>>>>> error-gt0-gsc-nonfatal-fuse-crc-check >>>>>>> 0x000000000000004e 0 >>>>>>> error-gt0-gsc-nonfatal-selfmbist >>>>>>> 0x000000000000004f 0 >>>>>>> error-gt0-gsc-nonfatal-aon-parity >>>>>>> 0x0000000000000050 0 >>>>>>> error-gt1-correctable-guc >>>>>>> 0x1000000000000001 0 >>>>>>> error-gt1-correctable-slm >>>>>>> 0x1000000000000003 0 >>>>>>> error-gt1-correctable-eu-ic >>>>>>> 0x1000000000000004 0 >>>>>>> error-gt1-correctable-eu-grf >>>>>>> 0x1000000000000005 0 >>>>>>> error-gt1-fatal-guc >>>>>>> 0x1000000000000009 0 >>>>>>> error-gt1-fatal-slm >>>>>>> 0x100000000000000d 0 >>>>>>> error-gt1-fatal-eu-grf >>>>>>> 0x100000000000000f 0 >>>>>>> error-gt1-fatal-fpu >>>>>>> 0x1000000000000010 0 >>>>>>> error-gt1-fatal-tlb >>>>>>> 0x1000000000000011 0 >>>>>>> error-gt1-fatal-l3-fabric >>>>>>> 0x1000000000000012 0 >>>>>>> error-gt1-correctable-subslice >>>>>>> 0x1000000000000013 0 >>>>>>> error-gt1-correctable-l3bank >>>>>>> 0x1000000000000014 0 >>>>>>> error-gt1-fatal-subslice >>>>>>> 0x1000000000000015 0 >>>>>>> error-gt1-fatal-l3bank >>>>>>> 0x1000000000000016 0 >>>>>>> error-gt1-sgunit-correctable >>>>>>> 0x1000000000000017 0 >>>>>>> error-gt1-sgunit-nonfatal >>>>>>> 0x1000000000000018 0 >>>>>>> error-gt1-sgunit-fatal >>>>>>> 0x1000000000000019 0 >>>>>>> error-gt1-soc-fatal-psf-csc-0 >>>>>>> 0x100000000000001a 0 >>>>>>> error-gt1-soc-fatal-psf-csc-1 >>>>>>> 0x100000000000001b 0 >>>>>>> error-gt1-soc-fatal-psf-csc-2 >>>>>>> 0x100000000000001c 0 >>>>>>> error-gt1-soc-fatal-punit >>>>>>> 0x100000000000001d 0 >>>>>>> error-gt1-soc-fatal-psf-0 >>>>>>> 0x100000000000001e 0 >>>>>>> error-gt1-soc-fatal-psf-1 >>>>>>> 0x100000000000001f 0 >>>>>>> error-gt1-soc-fatal-psf-2 >>>>>>> 0x1000000000000020 0 >>>>>>> error-gt1-soc-fatal-cd0 >>>>>>> 0x1000000000000021 0 >>>>>>> error-gt1-soc-fatal-cd0-mdfi >>>>>>> 0x1000000000000022 0 >>>>>>> error-gt1-soc-fatal-mdfi-east >>>>>>> 0x1000000000000023 0 >>>>>>> error-gt1-soc-fatal-mdfi-south >>>>>>> 0x1000000000000024 0 >>>>>>> error-gt1-soc-fatal-hbm-ss0-0 >>>>>>> 0x1000000000000025 0 >>>>>>> error-gt1-soc-fatal-hbm-ss0-1 >>>>>>> 0x1000000000000026 0 >>>>>>> error-gt1-soc-fatal-hbm-ss0-2 >>>>>>> 0x1000000000000027 0 >>>>>>> error-gt1-soc-fatal-hbm-ss0-3 >>>>>>> 0x1000000000000028 0 >>>>>>> error-gt1-soc-fatal-hbm-ss0-4 >>>>>>> 0x1000000000000029 0 >>>>>>> error-gt1-soc-fatal-hbm-ss0-5 >>>>>>> 0x100000000000002a 0 >>>>>>> error-gt1-soc-fatal-hbm-ss0-6 >>>>>>> 0x100000000000002b 0 >>>>>>> error-gt1-soc-fatal-hbm-ss0-7 >>>>>>> 0x100000000000002c 0 >>>>>>> error-gt1-soc-fatal-hbm-ss1-0 >>>>>>> 0x100000000000002d 0 >>>>>>> error-gt1-soc-fatal-hbm-ss1-1 >>>>>>> 0x100000000000002e 0 >>>>>>> error-gt1-soc-fatal-hbm-ss1-2 >>>>>>> 0x100000000000002f 0 >>>>>>> error-gt1-soc-fatal-hbm-ss1-3 >>>>>>> 0x1000000000000030 0 >>>>>>> error-gt1-soc-fatal-hbm-ss1-4 >>>>>>> 0x1000000000000031 0 >>>>>>> error-gt1-soc-fatal-hbm-ss1-5 >>>>>>> 0x1000000000000032 0 >>>>>>> error-gt1-soc-fatal-hbm-ss1-6 >>>>>>> 0x1000000000000033 0 >>>>>>> error-gt1-soc-fatal-hbm-ss1-7 >>>>>>> 0x1000000000000034 0 >>>>>>> error-gt1-soc-fatal-hbm-ss2-0 >>>>>>> 0x1000000000000035 0 >>>>>>> error-gt1-soc-fatal-hbm-ss2-1 >>>>>>> 0x1000000000000036 0 >>>>>>> error-gt1-soc-fatal-hbm-ss2-2 >>>>>>> 0x1000000000000037 0 >>>>>>> error-gt1-soc-fatal-hbm-ss2-3 >>>>>>> 0x1000000000000038 0 >>>>>>> error-gt1-soc-fatal-hbm-ss2-4 >>>>>>> 0x1000000000000039 0 >>>>>>> error-gt1-soc-fatal-hbm-ss2-5 >>>>>>> 0x100000000000003a 0 >>>>>>> error-gt1-soc-fatal-hbm-ss2-6 >>>>>>> 0x100000000000003b 0 >>>>>>> error-gt1-soc-fatal-hbm-ss2-7 >>>>>>> 0x100000000000003c 0 >>>>>>> error-gt1-soc-fatal-hbm-ss3-0 >>>>>>> 0x100000000000003d 0 >>>>>>> error-gt1-soc-fatal-hbm-ss3-1 >>>>>>> 0x100000000000003e 0 >>>>>>> error-gt1-soc-fatal-hbm-ss3-2 >>>>>>> 0x100000000000003f 0 >>>>>>> error-gt1-soc-fatal-hbm-ss3-3 >>>>>>> 0x1000000000000040 0 >>>>>>> error-gt1-soc-fatal-hbm-ss3-4 >>>>>>> 0x1000000000000041 0 >>>>>>> error-gt1-soc-fatal-hbm-ss3-5 >>>>>>> 0x1000000000000042 0 >>>>>>> error-gt1-soc-fatal-hbm-ss3-6 >>>>>>> 0x1000000000000043 0 >>>>>>> error-gt1-soc-fatal-hbm-ss3-7 >>>>>>> 0x1000000000000044 0 >>>>>>> >>>>>>> wait on a error event: >>>>>>> >>>>>>> $ ./drm_ras WAIT_ON_EVENT --device=drm:/dev/dri/card1 >>>>>>> waiting for error event >>>>>>> error event received >>>>>>> counter value 0 >>>>>>> >>>>>>> list all errors: >>>>>>> >>>>>>> $ ./drm_ras LIST_ERRORS --device=drm:/dev/dri/card1 >>>>>>> name config-id >>>>>>> >>>>>>> error-gt0-correctable-guc >>>>>>> 0x0000000000000001 >>>>>>> error-gt0-correctable-slm >>>>>>> 0x0000000000000003 >>>>>>> error-gt0-correctable-eu-ic >>>>>>> 0x0000000000000004 >>>>>>> error-gt0-correctable-eu-grf >>>>>>> 0x0000000000000005 >>>>>>> error-gt0-fatal-guc >>>>>>> 0x0000000000000009 >>>>>>> error-gt0-fatal-slm >>>>>>> 0x000000000000000d >>>>>>> error-gt0-fatal-eu-grf >>>>>>> 0x000000000000000f >>>>>>> error-gt0-fatal-fpu >>>>>>> 0x0000000000000010 >>>>>>> error-gt0-fatal-tlb >>>>>>> 0x0000000000000011 >>>>>>> error-gt0-fatal-l3-fabric >>>>>>> 0x0000000000000012 >>>>>>> error-gt0-correctable-subslice >>>>>>> 0x0000000000000013 >>>>>>> error-gt0-correctable-l3bank >>>>>>> 0x0000000000000014 >>>>>>> error-gt0-fatal-subslice >>>>>>> 0x0000000000000015 >>>>>>> error-gt0-fatal-l3bank >>>>>>> 0x0000000000000016 >>>>>>> error-gt0-sgunit-correctable >>>>>>> 0x0000000000000017 >>>>>>> error-gt0-sgunit-nonfatal >>>>>>> 0x0000000000000018 >>>>>>> error-gt0-sgunit-fatal >>>>>>> 0x0000000000000019 >>>>>>> error-gt0-soc-fatal-psf-csc-0 >>>>>>> 0x000000000000001a >>>>>>> error-gt0-soc-fatal-psf-csc-1 >>>>>>> 0x000000000000001b >>>>>>> error-gt0-soc-fatal-psf-csc-2 >>>>>>> 0x000000000000001c >>>>>>> error-gt0-soc-fatal-punit >>>>>>> 0x000000000000001d >>>>>>> error-gt0-soc-fatal-psf-0 >>>>>>> 0x000000000000001e >>>>>>> error-gt0-soc-fatal-psf-1 >>>>>>> 0x000000000000001f >>>>>>> error-gt0-soc-fatal-psf-2 >>>>>>> 0x0000000000000020 >>>>>>> error-gt0-soc-fatal-cd0 >>>>>>> 0x0000000000000021 >>>>>>> error-gt0-soc-fatal-cd0-mdfi >>>>>>> 0x0000000000000022 >>>>>>> error-gt0-soc-fatal-mdfi-east >>>>>>> 0x0000000000000023 >>>>>>> error-gt0-soc-fatal-mdfi-south >>>>>>> 0x0000000000000024 >>>>>>> error-gt0-soc-fatal-hbm-ss0-0 >>>>>>> 0x0000000000000025 >>>>>>> error-gt0-soc-fatal-hbm-ss0-1 >>>>>>> 0x0000000000000026 >>>>>>> error-gt0-soc-fatal-hbm-ss0-2 >>>>>>> 0x0000000000000027 >>>>>>> error-gt0-soc-fatal-hbm-ss0-3 >>>>>>> 0x0000000000000028 >>>>>>> error-gt0-soc-fatal-hbm-ss0-4 >>>>>>> 0x0000000000000029 >>>>>>> error-gt0-soc-fatal-hbm-ss0-5 >>>>>>> 0x000000000000002a >>>>>>> error-gt0-soc-fatal-hbm-ss0-6 >>>>>>> 0x000000000000002b >>>>>>> error-gt0-soc-fatal-hbm-ss0-7 >>>>>>> 0x000000000000002c >>>>>>> error-gt0-soc-fatal-hbm-ss1-0 >>>>>>> 0x000000000000002d >>>>>>> error-gt0-soc-fatal-hbm-ss1-1 >>>>>>> 0x000000000000002e >>>>>>> error-gt0-soc-fatal-hbm-ss1-2 >>>>>>> 0x000000000000002f >>>>>>> error-gt0-soc-fatal-hbm-ss1-3 >>>>>>> 0x0000000000000030 >>>>>>> error-gt0-soc-fatal-hbm-ss1-4 >>>>>>> 0x0000000000000031 >>>>>>> error-gt0-soc-fatal-hbm-ss1-5 >>>>>>> 0x0000000000000032 >>>>>>> error-gt0-soc-fatal-hbm-ss1-6 >>>>>>> 0x0000000000000033 >>>>>>> error-gt0-soc-fatal-hbm-ss1-7 >>>>>>> 0x0000000000000034 >>>>>>> error-gt0-soc-fatal-hbm-ss2-0 >>>>>>> 0x0000000000000035 >>>>>>> error-gt0-soc-fatal-hbm-ss2-1 >>>>>>> 0x0000000000000036 >>>>>>> error-gt0-soc-fatal-hbm-ss2-2 >>>>>>> 0x0000000000000037 >>>>>>> error-gt0-soc-fatal-hbm-ss2-3 >>>>>>> 0x0000000000000038 >>>>>>> error-gt0-soc-fatal-hbm-ss2-4 >>>>>>> 0x0000000000000039 >>>>>>> error-gt0-soc-fatal-hbm-ss2-5 >>>>>>> 0x000000000000003a >>>>>>> error-gt0-soc-fatal-hbm-ss2-6 >>>>>>> 0x000000000000003b >>>>>>> error-gt0-soc-fatal-hbm-ss2-7 >>>>>>> 0x000000000000003c >>>>>>> error-gt0-soc-fatal-hbm-ss3-0 >>>>>>> 0x000000000000003d >>>>>>> error-gt0-soc-fatal-hbm-ss3-1 >>>>>>> 0x000000000000003e >>>>>>> error-gt0-soc-fatal-hbm-ss3-2 >>>>>>> 0x000000000000003f >>>>>>> error-gt0-soc-fatal-hbm-ss3-3 >>>>>>> 0x0000000000000040 >>>>>>> error-gt0-soc-fatal-hbm-ss3-4 >>>>>>> 0x0000000000000041 >>>>>>> error-gt0-soc-fatal-hbm-ss3-5 >>>>>>> 0x0000000000000042 >>>>>>> error-gt0-soc-fatal-hbm-ss3-6 >>>>>>> 0x0000000000000043 >>>>>>> error-gt0-soc-fatal-hbm-ss3-7 >>>>>>> 0x0000000000000044 >>>>>>> error-gt0-gsc-correctable-sram-ecc >>>>>>> 0x0000000000000045 >>>>>>> error-gt0-gsc-nonfatal-mia-shutdown >>>>>>> 0x0000000000000046 >>>>>>> error-gt0-gsc-nonfatal-mia-int >>>>>>> 0x0000000000000047 >>>>>>> error-gt0-gsc-nonfatal-sram-ecc >>>>>>> 0x0000000000000048 >>>>>>> error-gt0-gsc-nonfatal-wdg-timeout >>>>>>> 0x0000000000000049 >>>>>>> error-gt0-gsc-nonfatal-rom-parity >>>>>>> 0x000000000000004a >>>>>>> error-gt0-gsc-nonfatal-ucode-parity >>>>>>> 0x000000000000004b >>>>>>> error-gt0-gsc-nonfatal-glitch-det >>>>>>> 0x000000000000004c >>>>>>> error-gt0-gsc-nonfatal-fuse-pull >>>>>>> 0x000000000000004d >>>>>>> error-gt0-gsc-nonfatal-fuse-crc-check >>>>>>> 0x000000000000004e >>>>>>> error-gt0-gsc-nonfatal-selfmbist >>>>>>> 0x000000000000004f >>>>>>> error-gt0-gsc-nonfatal-aon-parity >>>>>>> 0x0000000000000050 >>>>>>> error-gt1-correctable-guc >>>>>>> 0x1000000000000001 >>>>>>> error-gt1-correctable-slm >>>>>>> 0x1000000000000003 >>>>>>> error-gt1-correctable-eu-ic >>>>>>> 0x1000000000000004 >>>>>>> error-gt1-correctable-eu-grf�� >>>>>>> 0x1000000000000005 >>>>>>> error-gt1-fatal-guc >>>>>>> 0x1000000000000009 >>>>>>> error-gt1-fatal-slm >>>>>>> 0x100000000000000d >>>>>>> error-gt1-fatal-eu-grf >>>>>>> 0x100000000000000f >>>>>>> error-gt1-fatal-fpu >>>>>>> 0x1000000000000010 >>>>>>> error-gt1-fatal-tlb >>>>>>> 0x1000000000000011 >>>>>>> error-gt1-fatal-l3-fabric >>>>>>> 0x1000000000000012 >>>>>>> error-gt1-correctable-subslice >>>>>>> 0x1000000000000013 >>>>>>> error-gt1-correctable-l3bank >>>>>>> 0x1000000000000014 >>>>>>> error-gt1-fatal-subslice >>>>>>> 0x1000000000000015 >>>>>>> error-gt1-fatal-l3bank >>>>>>> 0x1000000000000016 >>>>>>> error-gt1-sgunit-correctable >>>>>>> 0x1000000000000017 >>>>>>> error-gt1-sgunit-nonfatal >>>>>>> 0x1000000000000018 >>>>>>> error-gt1-sgunit-fatal >>>>>>> 0x1000000000000019 >>>>>>> error-gt1-soc-fatal-psf-csc-0 >>>>>>> 0x100000000000001a >>>>>>> error-gt1-soc-fatal-psf-csc-1 >>>>>>> 0x100000000000001b >>>>>>> error-gt1-soc-fatal-psf-csc-2 >>>>>>> 0x100000000000001c >>>>>>> error-gt1-soc-fatal-punit >>>>>>> 0x100000000000001d >>>>>>> error-gt1-soc-fatal-psf-0 >>>>>>> 0x100000000000001e >>>>>>> error-gt1-soc-fatal-psf-1 >>>>>>> 0x100000000000001f >>>>>>> error-gt1-soc-fatal-psf-2 >>>>>>> 0x1000000000000020 >>>>>>> error-gt1-soc-fatal-cd0 >>>>>>> 0x1000000000000021 >>>>>>> error-gt1-soc-fatal-cd0-mdfi �� >>>>>>> 0x1000000000000022 >>>>>>> error-gt1-soc-fatal-mdfi-east >>>>>>> 0x1000000000000023 >>>>>>> error-gt1-soc-fatal-mdfi-south >>>>>>> 0x1000000000000024 >>>>>>> error-gt1-soc-fatal-hbm-ss0-0 >>>>>>> 0x1000000000000025 >>>>>>> error-gt1-soc-fatal-hbm-ss0-1 >>>>>>> 0x1000000000000026 >>>>>>> error-gt1-soc-fatal-hbm-ss0-2 >>>>>>> 0x1000000000000027 >>>>>>> error-gt1-soc-fatal-hbm-ss0-3 >>>>>>> 0x1000000000000028 >>>>>>> error-gt1-soc-fatal-hbm-ss0-4 >>>>>>> 0x1000000000000029 >>>>>>> error-gt1-soc-fatal-hbm-ss0-5 >>>>>>> 0x100000000000002a >>>>>>> error-gt1-soc-fatal-hbm-ss0-6 >>>>>>> 0x100000000000002b >>>>>>> error-gt1-soc-fatal-hbm-ss0-7 >>>>>>> 0x100000000000002c >>>>>>> error-gt1-soc-fatal-hbm-ss1-0 >>>>>>> 0x100000000000002d >>>>>>> error-gt1-soc-fatal-hbm-ss1-1 >>>>>>> 0x100000000000002e >>>>>>> error-gt1-soc-fatal-hbm-ss1-2 >>>>>>> 0x100000000000002f >>>>>>> error-gt1-soc-fatal-hbm-ss1-3 >>>>>>> 0x1000000000000030 >>>>>>> error-gt1-soc-fatal-hbm-ss1-4 >>>>>>> 0x1000000000000031 >>>>>>> error-gt1-soc-fatal-hbm-ss1-5 >>>>>>> 0x1000000000000032 >>>>>>> error-gt1-soc-fatal-hbm-ss1-6 >>>>>>> 0x1000000000000033 >>>>>>> error-gt1-soc-fatal-hbm-ss1-7 >>>>>>> 0x1000000000000034 >>>>>>> error-gt1-soc-fatal-hbm-ss2-0 >>>>>>> 0x1000000000000035 >>>>>>> error-gt1-soc-fatal-hbm-ss2-1 >>>>>>> 0x1000000000000036 >>>>>>> error-gt1-soc-fatal-hbm-ss2-2 >>>>>>> 0x1000000000000037 >>>>>>> error-gt1-soc-fatal-hbm-ss2-3 >>>>>>> 0x1000000000000038 >>>>>>> error-gt1-soc-fatal-hbm-ss2-4 >>>>>>> 0x1000000000000039 >>>>>>> error-gt1-soc-fatal-hbm-ss2-5 >>>>>>> 0x100000000000003a >>>>>>> error-gt1-soc-fatal-hbm-ss2-6 >>>>>>> 0x100000000000003b >>>>>>> error-gt1-soc-fatal-hbm-ss2-7 >>>>>>> 0x100000000000003c >>>>>>> error-gt1-soc-fatal-hbm-ss3-0 >>>>>>> 0x100000000000003d >>>>>>> error-gt1-soc-fatal-hbm-ss3-1 >>>>>>> 0x100000000000003e >>>>>>> error-gt1-soc-fatal-hbm-ss3-2 >>>>>>> 0x100000000000003f >>>>>>> error-gt1-soc-fatal-hbm-ss3-3 >>>>>>> 0x1000000000000040 >>>>>>> error-gt1-soc-fatal-hbm-ss3-4 >>>>>>> 0x1000000000000041 >>>>>>> error-gt1-soc-fatal-hbm-ss3-5 >>>>>>> 0x1000000000000042 >>>>>>> error-gt1-soc-fatal-hbm-ss3-6 >>>>>>> 0x1000000000000043 >>>>>>> error-gt1-soc-fatal-hbm-ss3-7 >>>>>>> 0x1000000000000044 >>>>>>> >>>>>>> Cc: Alex Deucher <alexander.deuc...@amd.com> >>>>>>> Cc: David Airlie <airl...@gmail.com> >>>>>>> Cc: Daniel Vetter <dan...@ffwll.ch> >>>>>>> Cc: Joonas Lahtinen <joonas.lahti...@linux.intel.com> >>>>>>> Cc: Oded Gabbay <ogab...@kernel.org> >>>>>>> Cc: Tomer Tayar <tta...@habana.ai> >>>>>>> Cc: Hawking Zhang <hawking.zh...@amd.com> >>>>>>> Cc: Harish Kasiviswanathan <harish.kasiviswanat...@amd.com> >>>>>>> Cc: Kuehling Felix <felix.kuehl...@amd.com> >>>>>>> Cc: Tuikov Luben <luben.tui...@amd.com> >>>>>>> Cc: Ruhl, Michael J <michael.j.r...@intel.com> >>>>>>> >>>>>>> >>>>>>> Aravind Iddamsetty (5): >>>>>>> drm/netlink: Add netlink infrastructure >>>>>>> drm/xe/RAS: Register netlink capability >>>>>>> drm/xe/RAS: Expose the error counters >>>>>>> drm/netlink: Define multicast groups >>>>>>> drm/xe/RAS: send multicast event on occurrence of an error >>>>>>> >>>>>>> drivers/gpu/drm/Makefile | 1 + >>>>>>> drivers/gpu/drm/drm_drv.c | 7 + >>>>>>> drivers/gpu/drm/drm_netlink.c | 195 ++++++++++ >>>>>>> drivers/gpu/drm/xe/Makefile | 1 + >>>>>>> drivers/gpu/drm/xe/xe_device.c | 4 + >>>>>>> drivers/gpu/drm/xe/xe_device_types.h | 1 + >>>>>>> drivers/gpu/drm/xe/xe_hw_error.c | 33 ++ >>>>>>> drivers/gpu/drm/xe/xe_netlink.c | 517 >>>>>>> +++++++++++++++++++++++++++ >>>>>>> include/drm/drm_device.h | 8 + >>>>>>> include/drm/drm_drv.h | 7 + >>>>>>> include/drm/drm_netlink.h | 35 ++ >>>>>>> include/uapi/drm/drm_netlink.h | 87 +++++ >>>>>>> include/uapi/drm/xe_drm.h | 81 +++++ >>>>>>> 13 files changed, 977 insertions(+) >>>>>>> create mode 100644 drivers/gpu/drm/drm_netlink.c >>>>>>> create mode 100644 drivers/gpu/drm/xe/xe_netlink.c >>>>>>> create mode 100644 include/drm/drm_netlink.h >>>>>>> create mode 100644 include/uapi/drm/drm_netlink.h >>>>>>> >>>>>>> -- >>>>>>> 2.25.1 >>>>>>>