On 3/11/20 11:00 AM, Ravi Bangoria wrote: > Hi Kim, Hi Ravi,
> On 3/6/20 3:36 AM, Kim Phillips wrote: >>> On 3/3/20 3:55 AM, Kim Phillips wrote: >>>> On 3/2/20 2:21 PM, Stephane Eranian wrote: >>>>> On Mon, Mar 2, 2020 at 2:13 AM Peter Zijlstra <pet...@infradead.org> >>>>> wrote: >>>>>> >>>>>> On Mon, Mar 02, 2020 at 10:53:44AM +0530, Ravi Bangoria wrote: >>>>>>> Modern processors export such hazard data in Performance >>>>>>> Monitoring Unit (PMU) registers. Ex, 'Sampled Instruction Event >>>>>>> Register' on IBM PowerPC[1][2] and 'Instruction-Based Sampling' on >>>>>>> AMD[3] provides similar information. >>>>>>> >>>>>>> Implementation detail: >>>>>>> >>>>>>> A new sample_type called PERF_SAMPLE_PIPELINE_HAZ is introduced. >>>>>>> If it's set, kernel converts arch specific hazard information >>>>>>> into generic format: >>>>>>> >>>>>>> struct perf_pipeline_haz_data { >>>>>>> /* Instruction/Opcode type: Load, Store, Branch .... */ >>>>>>> __u8 itype; >>>>>>> /* Instruction Cache source */ >>>>>>> __u8 icache; >>>>>>> /* Instruction suffered hazard in pipeline stage */ >>>>>>> __u8 hazard_stage; >>>>>>> /* Hazard reason */ >>>>>>> __u8 hazard_reason; >>>>>>> /* Instruction suffered stall in pipeline stage */ >>>>>>> __u8 stall_stage; >>>>>>> /* Stall reason */ >>>>>>> __u8 stall_reason; >>>>>>> __u16 pad; >>>>>>> }; >>>>>> >>>>>> Kim, does this format indeed work for AMD IBS? >>>> >>>> It's not really 1:1, we don't have these separations of stages >>>> and reasons, for example: we have missed in L2 cache, for example. >>>> So IBS output is flatter, with more cycle latency figures than >>>> IBM's AFAICT. >>> >>> AMD IBS captures pipeline latency data incase Fetch sampling like the >>> Fetch latency, tag to retire latency, completion to retire latency and >>> so on. Yes, Ops sampling do provide more data on load/store centric >>> information. But it also captures more detailed data for Branch >>> instructions. >>> And we also looked at ARM SPE, which also captures more details pipeline >>> data and latency information. >>> >>>>> Personally, I don't like the term hazard. This is too IBM Power >>>>> specific. We need to find a better term, maybe stall or penalty. >>>> >>>> Right, IBS doesn't have a filter to only count stalled or otherwise >>>> bad events. IBS' PPR descriptions has one occurrence of the >>>> word stall, and no penalty. The way I read IBS is it's just >>>> reporting more sample data than just the precise IP: things like >>>> hits, misses, cycle latencies, addresses, types, etc., so words >>>> like 'extended', or the 'auxiliary' already used today even >>>> are more appropriate for IBS, although I'm the last person to >>>> bikeshed. >>> >>> We are thinking of using "pipeline" word instead of Hazard. >> >> Hm, the word 'pipeline' occurs 0 times in IBS documentation. > > NP. We thought pipeline is generic hw term so we proposed "pipeline" > word. We are open to term which can be generic enough. > >> >> I realize there are a couple of core pipeline-specific pieces >> of information coming out of it, but the vast majority >> are addresses, latencies of various components in the memory >> hierarchy, and various component hit/miss bits. > > Yes. we should capture core pipeline specific details. For example, > IBS generates Branch unit information(IbsOpData1) and Icahce related > data(IbsFetchCtl) which is something that shouldn't be extended as > part of perf-mem, IMO. Sure, IBS Op-side output is more 'perf mem' friendly, and so it should populate perf_mem_data_src fields, just like POWER9 can: union perf_mem_data_src { ... __u64 mem_rsvd:24, mem_snoopx:2, /* snoop mode, ext */ mem_remote:1, /* remote */ mem_lvl_num:4, /* memory hierarchy level number */ mem_dtlb:7, /* tlb access */ mem_lock:2, /* lock instr */ mem_snoop:5, /* snoop mode */ mem_lvl:14, /* memory hierarchy level */ mem_op:5; /* type of opcode */ E.g., SIER[LDST] SIER[A_XLATE_SRC] can be used to populate mem_lvl[_num], SIER_TYPE can be used to populate 'mem_op', 'mem_lock', and the Reload Bus Source Encoding bits can be used to populate mem_snoop, right? For IBS, I see PERF_SAMPLE_ADDR and PERF_SAMPLE_PHYS_ADDR can be used for the ld/st target addresses, too. >> What's needed here is a vendor-specific extended >> sample information that all these technologies gather, >> of which things like e.g., 'L1 TLB cycle latency' we >> all should have in common. > > Yes. We will include fields to capture the latency cycles (like Issue > latency, Instruction completion latency etc..) along with other pipeline > details in the proposed structure. Latency figures are just an example, and from what I can tell, struct perf_sample_data already has a 'weight' member, used with PERF_SAMPLE_WEIGHT, that is used by intel-pt to transfer memory access latency figures. Granted, that's a bad name given all other vendors don't call latency 'weight'. I didn't see any latency figures coming out of POWER9, and do not expect this patchseries to implement those of other vendors, e.g., AMD's IBS; leave each vendor to amend perf to suit their own h/w output please. My main point there, however, was that each vendor should use streamlined record-level code to just copy the data in the proprietary format that their hardware produces, and then then perf tooling can synthesize the events from the raw data at report/script/etc. time. >> I'm not sure why a new PERF_SAMPLE_PIPELINE_HAZ is needed >> either. Can we use PERF_SAMPLE_AUX instead? > > We took a look at PERF_SAMPLE_AUX. IIUC, PERF_SAMPLE_AUX is intended when > large volume of data needs to be captured as part of perf.data without > frequent PMIs. But proposed type is to address the capture of pipeline SAMPLE_AUX shouldn't care whether the volume is large, or how frequent PMIs are, even though it may be used in those environments. > information on each sample using PMI at periodic intervals. Hence proposing > PERF_SAMPLE_PIPELINE_HAZ. And that's fine for any extra bits that POWER9 has to convey to its users beyond things already represented by other sample types like PERF_SAMPLE_DATA_SRC, but the capturing of both POWER9 and other vendor e.g., AMD IBS data can be made vendor-independent at record time by using SAMPLE_AUX, or SAMPLE_RAW even, which is what IBS currently uses. >> Take a look at >> commit 98dcf14d7f9c "perf tools: Add kernel AUX area sampling >> definitions". The sample identifier can be used to determine >> which vendor's sampling IP's data is in it, and events can >> be recorded just by copying the content of the SIER, etc. >> registers, and then events get synthesized from the aux >> sample at report/inject/annotate etc. time. This allows >> for less sample recording overhead, and moves all the vendor >> specific decoding and common event conversions for userspace >> to figure out. > > When AUX buffer data is structured, tool side changes added to present the > pipeline data can be re-used. Not sure I understand: AUX data would be structured on each vendor's raw h/w register formats. Thanks, Kim >>>>> Also worth considering is the support of ARM SPE (Statistical >>>>> Profiling Extension) which is their version of IBS. >>>>> Whatever gets added need to cover all three with no limitations. >>>> >>>> I thought Intel's various LBR, PEBS, and PT supported providing >>>> similar sample data in perf already, like with perf mem/c2c? >>> >>> perf-mem is more of data centric in my opinion. It is more towards >>> memory profiling. So proposal here is to expose pipeline related >>> details like stalls and latencies. >> >> Like I said, I don't see it that way, I see it as "any particular >> vendor's event's extended details', and these pipeline details >> have overlap with existing infrastructure within perf, e.g., L2 >> cache misses. >> >> Kim >> >