hoy added inline comments.

================
Comment at: llvm/include/llvm/IR/PseudoProbe.h:41
   //  [18:3] - probe id
-  //  [25:19] - reserved
+  //  [25:19] - probe distribution factor
   //  [28:26] - probe type, see PseudoProbeType
----------------
wmi wrote:
> hoy wrote:
> > wmi wrote:
> > > hoy wrote:
> > > > wmi wrote:
> > > > > hoy wrote:
> > > > > > hoy wrote:
> > > > > > > wmi wrote:
> > > > > > > > The bits in discriminator is a scare resource. Have you 
> > > > > > > > considered using less bits to represent probe distribution 
> > > > > > > > factor? I guess it is possible that using a little more coarse 
> > > > > > > > grain distribution factor won't affect performance.
> > > > > > > That's a good point. We are using seven bits to represent [0, 
> > > > > > > 100] so that integral numbers can be distinguished. Yes, we could 
> > > > > > > use fewer bits to represent, say 4 bits to represent only even 
> > > > > > > numbers. We could also not use any bits here but instead use the 
> > > > > > > distribution factor of the outer block probes when the 
> > > > > > > competition of those bits are high. I can do an experiment to see 
> > > > > > > how well that works.
> > > > > > On a second thought, using the distribution factor of block probes 
> > > > > > for call probe may not work well since a callsite may be surrounded 
> > > > > > by more than one block probes. 
> > > > > > 
> > > > > > We could use also fewer bits like 6 bits to encode even numbers in 
> > > > > > the range [0, 100], or 5 bits to encoding multiples of 3 in [0, 
> > > > > > 100]. I did a profile quality measurement with the even number 
> > > > > > encoding. It's OK overall except for two SPEC benchmarks. I guess 
> > > > > > it's a trade-off we'll have to take when there's a competition on 
> > > > > > those bits. 
> > > > > Could you elaborate a little bit about the case that a callsite is 
> > > > > surrounded by more than one block probe? Is it because bb merge like 
> > > > > in cfg simplification?
> > > > Yes, block merge in cfg simplification is a good example. Inlining can 
> > > > also end up with callee code and caller code in one block. Jump 
> > > > threading or other cfg optimizations that convert a conditional jump 
> > > > into an unconditional jump can result in block merge too.
> > > > 
> > > > So far our way to track block weight for blocks with multiple probes is 
> > > > to take the maximum count out of those probes. When it comes to 
> > > > tracking callsite count, it is handy and accurate to attach a dedicated 
> > > > distribution factor for each individual call. For example, when a call 
> > > > is inlined, the inlinee's probes will be cloned into the caller, and 
> > > > they will be prorated based on the callsite's dedicated distribution 
> > > > factor.
> > > Actually, I think we may be able to extend Discriminator and 
> > > PseudoProbeDwarfDiscriminator. To emit Discriminator into Dwarf, we need 
> > > to follow Dwarf standard about how many bits Discrminator is going to 
> > > occupy. But inside compiler, Discriminator is represented as MetaData so 
> > > it hasn't to be 32bits. For example, we can extend Discriminator MetaData 
> > > to be 64bits or even larger and specify only lower 32bits will be 
> > > actually emitted into Dwarf section. For intermediate information like 
> > > distribution factors, we can put it into the higher bits. 
> > That's a good idea, I like that. Actually we thought about that int the 
> > past and our concern was about memory cost since the discriminator filed in 
> > `DILexicalBlockFile` metadata is not optional. It is probably OK for pseudo 
> > probe since discriminators are only used for callsites. It might be a 
> > problem with -fdebug-info-for-profiling where discriminators can be used 
> > more often. 
> > 
> > It sounds to me extending the size of discriminator is desirable for pseudo 
> > probes and potentially FS-AFDO. It might be worth evaluating the cost at 
> > some time. What do you think?
> Yes, it is worth evaluating the cost. It is only about intermediate data in 
> compiler and it won't affect the binary and profile output, therefore it 
> won't introduce backward compatibility issue. I think it is up to you to 
> choose whether to evaluate it now or later.  
Sounds good. Will do a measurement for both -fpseudo-probe-for-profiling and 
-fdebug-info-for-profiling later.


================
Comment at: llvm/lib/IR/PseudoProbe.cpp:65
+void setProbeDistributionFactor(Instruction &Inst, float Factor) {
+  assert(Factor <= 1);
+  if (auto *II = dyn_cast<PseudoProbeInst>(&Inst)) {
----------------
wmi wrote:
> Add assertion message.
Will do.


================
Comment at: llvm/lib/Transforms/IPO/SampleProfile.cpp:363
   const FunctionSamples *CalleeSamples;
   uint64_t CallsiteCount;
+  // Call site distribution factor to prorate the profile samples for a
----------------
wmi wrote:
> CallsiteCount will be the count before being prorated or after if 
> CallsiteDistribution is not 1.0?
It is the count after prorated. The prorated count will be used to guide 
inlining. For example, if a callsite is duplicated in LTO prelink, then in LTO 
postlink the two copies will get their own distribution factors and their 
prorated counts are used to decide if they should be inlined independently.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D93264/new/

https://reviews.llvm.org/D93264

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to