Re: [PATCH BUGFIX 1/3] null_blk: set a separate timer for each command

Paolo Valente Sun, 29 Nov 2015 09:29:11 -0800

Hi Jens,
this is just to ask you whether you made any decision about these patches, 
including just not to apply them.


Thanks,
Paolo

Il giorno 03/nov/2015, alle ore 10:01, Paolo Valente <paolo.vale...@unimore.it> 
ha scritto:

> 
> Il giorno 02/nov/2015, alle ore 17:14, Jens Axboe <ax...@fb.com> ha scritto:
> 
>> On 11/02/2015 07:31 AM, Paolo Valente wrote:
>>> For the Timer IRQ mode (i.e., when command completions are delayed),
>>> there is one timer for each CPU. Each of these timers
>>> . has a completion queue associated with it, containing all the
>>>  command completions to be executed when the timer fires;
>>> . is set, and a new completion-to-execute is inserted into its
>>>  completion queue, every time the dispatch code for a new command
>>>  happens to be executed on the CPU related to the timer.
>>> 
>>> This implies that, if the dispatch of a new command happens to be
>>> executed on a CPU whose timer has already been set, but has not yet
>>> fired, then the timer is set again, to the completion time of the
>>> newly arrived command. When the timer eventually fires, all its queued
>>> completions are executed.
>>> 
>>> This way of handling delayed command completions entails the following
>>> problem: if more than one command completion is inserted into the
>>> queue of a timer before the timer fires, then the expiration time for
>>> the timer is moved forward every time each of these completions is
>>> enqueued. As a consequence, only the last completion enqueued enjoys a
>>> correct execution time, while all previous completions are unjustly
>>> delayed until the last completion is executed (and at that time they
>>> are executed all together).
>>> 
>>> Specifically, if all the above completions are enqueued almost at the
>>> same time, then the problem is negligible. On the opposite end, if
>>> every completion is enqueued a while after the previous completion was
>>> enqueued (in the extreme case, it is enqueued only right before the
>>> timer would have expired), then every enqueued completion, except for
>>> the last one, experiences an inflated delay, proportional to the number
>>> of completions enqueued after it. In the end, commands, and thus I/O
>>> requests, may be completed at an arbitrarily lower rate than the
>>> desired one.
>>> 
>>> This commit addresses this issue by replacing per-CPU timers with
>>> per-command timers, i.e., by associating an individual timer with each
>>> command.
>> 
>> Functionally the patch looks fine. My only worry is that a timer per command 
>> would be an unnecessary slowdown compared to pushing one timer forward. The 
>> problem should be fixable by still doing that, just maintaining next-expire 
>> instead. Maybe something that would still roughly be precise enough, while 
>> still getting some completion batching going? Or maybe that would be slower, 
>> and the individual timers are still better.
>> 
>> Comments?
>> 
> 
> I have tried to think about these questions. Unfortunately I was not able to 
> go beyond the following general considerations.
> 
> Given that:
> 1) hrtimer_start is very efficient;
> 2) to implement a batch-by-batch execution of command completions, a more 
> complex code would of course be needed in null_blk;
> I think that, to get a perceptible improvement, command completions should 
> probably be executed in batches of at least 3 or 4 commands each.
> 
> In this respect, commands can accumulate in a batch only if the time 
> granularity with which delayed commands are guaranteed to be executed, i.e, 
> the granularity with which timers eventually fire, happens to be coarser than 
> completion_nsec. But this would lead to the exact uncontrolled 
> throughput-loss problem addressed by this patch, although at a lesser extent. 
> The reason is as follows.
> 
> Consider a batch of 3 or 4 commands, executed at a certain timer expiration. 
> If the arrival time of these commands is close to the timer expiration, then 
> the commands happen to be executed with a delay close to the expected one. 
> If, on the other hand, their arrival time is far from the timer expiration, 
> then they are completed with a larger delay. In the worst case, their delay 
> is inflated by a factor proportional to the size of the batch, namely 3 or 4.
> 
> In the end, depending on the command arrival pattern, the throughput of the 
> emulated device may vary by 3 or 4 times. This would be a reduced version of 
> the issue addressed by this patch. In fact, with the current implementation 
> of delayed completions (per-cpu timers), the throughput may vary, in the 
> worst case, by a factor equal to the queue depth (64 by default).
> 
> Finally, a side note: as for whether the increased efficiency of batched 
> delayed completions is worth additional code complexity as well as throughput 
> losses/fluctuations, the low delays for which this efficiency becomes 
> important match the latencies of new very fast devices (or are at least in 
> the same order). But, exactly for the high speed of these devices, what 
> typically matters (AFAIK) in tests with these devices is understanding 
> whether the rest of the system can cope with their speed. And for this type 
> of measures, the best/typical choice is (again, AFAIK) not to delay request 
> completions at all in null_blk.
> 
> I hope these considerations may be somehow useful.
> 
> Thanks,
> Paolo
> 
>> -- 
>> Jens Axboe
> 
> 
> --
> Paolo Valente                                                 
> Algogroup
> Dipartimento di Fisica, Informatica e Matematica              
> Via Campi, 213/B
> 41125 Modena - Italy                                    
> homepage:  http://algogroup.unimore.it/people/paolo/


--
Paolo Valente                                                 
Algogroup
Dipartimento di Fisica, Informatica e Matematica                
Via Campi, 213/B
41125 Modena - Italy                                      
homepage:  http://algogroup.unimore.it/people/paolo/

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH BUGFIX 1/3] null_blk: set a separate timer for each command

Reply via email to