panthor: introduce job cycle and timestamp accounting

Steven Price Wed, 25 Sep 2024 02:56:12 -0700

On 23/09/2024 21:43, Adrián Larumbe wrote:
> Hi Steve,
> 
> On 23.09.2024 09:55, Steven Price wrote:
>> On 20/09/2024 23:36, Adrián Larumbe wrote:
>>> Hi Steve, thanks for the review.
>>
>> Hi Adrián,
>>
>>> I've applied all of your suggestions for the next patch series revision, so 
>>> I'll
>>> only be answering to your question about the 
>>> calc_profiling_ringbuf_num_slots
>>> function further down below.
>>>
>>
>> [...]
>>
>>>>> @@ -3003,6 +3190,34 @@ static const struct drm_sched_backend_ops 
>>>>> panthor_queue_sched_ops = {
>>>>>   .free_job = queue_free_job,
>>>>>  };
>>>>>  
>>>>> +static u32 calc_profiling_ringbuf_num_slots(struct panthor_device *ptdev,
>>>>> +                                u32 cs_ringbuf_size)
>>>>> +{
>>>>> + u32 min_profiled_job_instrs = U32_MAX;
>>>>> + u32 last_flag = fls(PANTHOR_DEVICE_PROFILING_ALL);
>>>>> +
>>>>> + /*
>>>>> +  * We want to calculate the minimum size of a profiled job's CS,
>>>>> +  * because since they need additional instructions for the sampling
>>>>> +  * of performance metrics, they might take up further slots in
>>>>> +  * the queue's ringbuffer. This means we might not need as many job
>>>>> +  * slots for keeping track of their profiling information. What we
>>>>> +  * need is the maximum number of slots we should allocate to this end,
>>>>> +  * which matches the maximum number of profiled jobs we can place
>>>>> +  * simultaneously in the queue's ring buffer.
>>>>> +  * That has to be calculated separately for every single job profiling
>>>>> +  * flag, but not in the case job profiling is disabled, since unprofiled
>>>>> +  * jobs don't need to keep track of this at all.
>>>>> +  */
>>>>> + for (u32 i = 0; i < last_flag; i++) {
>>>>> +         if (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL)
>>>>> +                 min_profiled_job_instrs =
>>>>> +                         min(min_profiled_job_instrs, 
>>>>> calc_job_credits(BIT(i)));
>>>>> + }
>>>>> +
>>>>> + return DIV_ROUND_UP(cs_ringbuf_size, min_profiled_job_instrs * 
>>>>> sizeof(u64));
>>>>> +}
>>>>
>>>> I may be missing something, but is there a situation where this is
>>>> different to calc_job_credits(0)? AFAICT the infrastructure you've added
>>>> can only add extra instructions to the no-flags case - whereas this
>>>> implies you're thinking that instructions may also be removed (or 
>>>> replaced).
>>>>
>>>> Steve
>>>
>>> Since we create a separate kernel BO to hold the profiling information 
>>> slot, we
>>> need one that would be able to accomodate as many slots as the maximum 
>>> number of
>>> profiled jobs we can insert simultaneously into the queue's ring buffer. 
>>> Because
>>> profiled jobs always take more instructions than unprofiled ones, then we 
>>> would
>>> usually need fewer slots than the number of unprofiled jobs we could insert 
>>> at
>>> once in the ring buffer.
>>>
>>> Because we represent profiling metrics with a bit mask, then we need to 
>>> test the
>>> size of the CS for every single metric enabled in isolation, since enabling 
>>> more
>>> than one will always mean a bigger CS, and therefore fewer jobs tracked at 
>>> once
>>> in the queue's ring buffer.
>>>
>>> In our case, calling calc_job_credits(0) would simply tell us the number of
>>> instructions we need for a normal job with no profiled features enabled, 
>>> which
>>> would always requiere less instructions than profiled ones, and therefore 
>>> more
>>> slots in the profiling info kernel BO. But we don't need to keep track of
>>> profiling numbers for unprofiled jobs, so there's no point in calculating 
>>> this
>>> number.
>>>
>>> At first I was simply allocating a profiling info kernel BO as big as the 
>>> number
>>> of simultaneous unprofiled job slots in the ring queue, but Boris pointed 
>>> out
>>> that since queue ringbuffers can be as big as 2GiB, a lot of this memory 
>>> would
>>> be wasted, since profiled jobs always require more slots because they hold 
>>> more
>>> instructions, so fewer profiling slots in said kernel BO.
>>>
>>> The value of this approach will eventually manifest if we decided to keep 
>>> track of
>>> more profiling metrics, since this code won't have to change at all, other 
>>> than
>>> adding new profiling flags in the panthor_device_profiling_flags enum.
>>
>> Thanks for the detailed explanation. I think what I was missing is that
>> the loop is checking each bit flag independently and *not* checking
>> calc_job_credits(0).
>>
>> The check for (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL) is probably what
>> confused me - that should be completely redundant. Or at least we need
>> something more intelligent if we have profiling bits which are not
>> mutually compatible.
> 
> I thought of an alternative that would only test bits that are actually part 
> of
> the mask:
> 
> static u32 calc_profiling_ringbuf_num_slots(struct panthor_device *ptdev,
>                                      u32 cs_ringbuf_size)
> {
>       u32 min_profiled_job_instrs = U32_MAX;
>       u32 profiling_mask = PANTHOR_DEVICE_PROFILING_ALL;
> 
>       while (profiling_mask) {
>               u32 i = ffs(profiling_mask) - 1;
>               profiling_mask &= ~BIT(i);
>               min_profiled_job_instrs =
>                       min(min_profiled_job_instrs, calc_job_credits(BIT(i)));
>       }
> 
>       return DIV_ROUND_UP(cs_ringbuf_size, min_profiled_job_instrs * 
> sizeof(u64));
> }
> 
> However, I don't think this would be more efficient, because ffs() is probably
> fetching the first set bit by performing register shifts, and I guess this 
> would
> take somewhat longer than iterating over every single bit from the last one,
> even if also matching them against the whole mask, just in case in future
> additions of performance metrics we decide to leave some of the lower
> significance bits untouched.


Efficiency isn't very important here - we're not on a fast path, so it's
more about ensuring the code is readable. I don't think the above is
more readable then the original for loop.

> Regarding your question about mutual compatibility, I don't think that is an
> issue here, because we're testing bits in isolation. If in the future we find
> out that some of the values we're profiling cannot be sampled at once, we can
> add that logic to the sysfs knob handler, to make sure UM cannot set forbidden
> profiling masks.

My comment about compatibility is because in the original above you were
calculating the top bit of PANTHOR_DEVICE_PROFILING_ALL:

> u32 last_flag = fls(PANTHOR_DEVICE_PROFILING_ALL);

then looping between 0 and that bit:

> for (u32 i = 0; i < last_flag; i++) {

So the test:

> if (BIT(i) & PANTHOR_DEVICE_PROFILING_ALL)

would only fail if PANTHOR_DEVICE_PROFILING_ALL had gaps in the bits
that it set. The only reason I can think for that to be true in the
future is if there is some sort of incompatibility - e.g. maybe there's
an old and new way of doing some form of profiling with the old way
being kept for backwards compatibility. But I suspect if/when that is
required we'll need to revisit this function anyway. So that 'if'
statement seems completely redundant (it's trivially always true).

Steve

>> I'm also not entirely sure that the amount of RAM saved is significant,
>> but you've already written the code so we might as well have the saving ;)
> 
> I think this was more evident before Boris suggested we reduce the basic slot
> size to that of a single cache line, because then the minimum profiled job
> might've taken twice as many ringbuffer slots as a nonprofiled one. In that
> case, we would need a half as big BO for holding the sampled data (in case the
> least size profiled job CS would extend over the 16 instruction boundary).
> I still think this is a good idea so that in the future we don't need to worry
> about adjusting the code that deals with preparing the right boilerplate CS,
> since it'll only be a matter of adding new instructions inside 
> prepare_job_instrs().
> 
>> Thanks,
>> Steve
>>
>>> Regards,
>>> Adrian
>>>
>>>>> +
>>>>>  static struct panthor_queue *
>>>>>  group_create_queue(struct panthor_group *group,
>>>>>              const struct drm_panthor_queue_create *args)
>>>>> @@ -3056,9 +3271,35 @@ group_create_queue(struct panthor_group *group,
>>>>>           goto err_free_queue;
>>>>>   }
>>>>>  
>>>>> + queue->profiling.slot_count =
>>>>> +         calc_profiling_ringbuf_num_slots(group->ptdev, 
>>>>> args->ringbuf_size);
>>>>> +
>>>>> + queue->profiling.slots =
>>>>> +         panthor_kernel_bo_create(group->ptdev, group->vm,
>>>>> +                                  queue->profiling.slot_count *
>>>>> +                                  sizeof(struct 
>>>>> panthor_job_profiling_data),
>>>>> +                                  DRM_PANTHOR_BO_NO_MMAP,
>>>>> +                                  DRM_PANTHOR_VM_BIND_OP_MAP_NOEXEC |
>>>>> +                                  DRM_PANTHOR_VM_BIND_OP_MAP_UNCACHED,
>>>>> +                                  PANTHOR_VM_KERNEL_AUTO_VA);
>>>>> +
>>>>> + if (IS_ERR(queue->profiling.slots)) {
>>>>> +         ret = PTR_ERR(queue->profiling.slots);
>>>>> +         goto err_free_queue;
>>>>> + }
>>>>> +
>>>>> + ret = panthor_kernel_bo_vmap(queue->profiling.slots);
>>>>> + if (ret)
>>>>> +         goto err_free_queue;
>>>>> +
>>>>> + /*
>>>>> +  * Credit limit argument tells us the total number of instructions
>>>>> +  * across all CS slots in the ringbuffer, with some jobs requiring
>>>>> +  * twice as many as others, depending on their profiling status.
>>>>> +  */
>>>>>   ret = drm_sched_init(&queue->scheduler, &panthor_queue_sched_ops,
>>>>>                        group->ptdev->scheduler->wq, 1,
>>>>> -                      args->ringbuf_size / (NUM_INSTRS_PER_SLOT * 
>>>>> sizeof(u64)),
>>>>> +                      args->ringbuf_size / sizeof(u64),
>>>>>                        0, msecs_to_jiffies(JOB_TIMEOUT_MS),
>>>>>                        group->ptdev->reset.wq,
>>>>>                        NULL, "panthor-queue", group->ptdev->base.dev);
>>>>> @@ -3354,6 +3595,7 @@ panthor_job_create(struct panthor_file *pfile,
>>>>>  {
>>>>>   struct panthor_group_pool *gpool = pfile->groups;
>>>>>   struct panthor_job *job;
>>>>> + u32 credits;
>>>>>   int ret;
>>>>>  
>>>>>   if (qsubmit->pad)
>>>>> @@ -3407,9 +3649,16 @@ panthor_job_create(struct panthor_file *pfile,
>>>>>           }
>>>>>   }
>>>>>  
>>>>> + job->profiling.mask = pfile->ptdev->profile_mask;
>>>>> + credits = calc_job_credits(job->profiling.mask);
>>>>> + if (credits == 0) {
>>>>> +         ret = -EINVAL;
>>>>> +         goto err_put_job;
>>>>> + }
>>>>> +
>>>>>   ret = drm_sched_job_init(&job->base,
>>>>>                            &job->group->queues[job->queue_idx]->entity,
>>>>> -                          1, job->group);
>>>>> +                          credits, job->group);
>>>>>   if (ret)
>>>>>           goto err_put_job;
>>>>>  
>>>
> 
> 
> Adrian Larumbe

Re: [PATCH v6 1/5] drm/panthor: introduce job cycle and timestamp accounting

Reply via email to