Hi Jens, this is just to ask you whether you made any decision about these patches, including just not to apply them.
Thanks, Paolo Il giorno 03/nov/2015, alle ore 10:01, Paolo Valente <paolo.vale...@unimore.it> ha scritto: > > Il giorno 02/nov/2015, alle ore 17:14, Jens Axboe <ax...@fb.com> ha scritto: > >> On 11/02/2015 07:31 AM, Paolo Valente wrote: >>> For the Timer IRQ mode (i.e., when command completions are delayed), >>> there is one timer for each CPU. Each of these timers >>> . has a completion queue associated with it, containing all the >>> command completions to be executed when the timer fires; >>> . is set, and a new completion-to-execute is inserted into its >>> completion queue, every time the dispatch code for a new command >>> happens to be executed on the CPU related to the timer. >>> >>> This implies that, if the dispatch of a new command happens to be >>> executed on a CPU whose timer has already been set, but has not yet >>> fired, then the timer is set again, to the completion time of the >>> newly arrived command. When the timer eventually fires, all its queued >>> completions are executed. >>> >>> This way of handling delayed command completions entails the following >>> problem: if more than one command completion is inserted into the >>> queue of a timer before the timer fires, then the expiration time for >>> the timer is moved forward every time each of these completions is >>> enqueued. As a consequence, only the last completion enqueued enjoys a >>> correct execution time, while all previous completions are unjustly >>> delayed until the last completion is executed (and at that time they >>> are executed all together). >>> >>> Specifically, if all the above completions are enqueued almost at the >>> same time, then the problem is negligible. On the opposite end, if >>> every completion is enqueued a while after the previous completion was >>> enqueued (in the extreme case, it is enqueued only right before the >>> timer would have expired), then every enqueued completion, except for >>> the last one, experiences an inflated delay, proportional to the number >>> of completions enqueued after it. In the end, commands, and thus I/O >>> requests, may be completed at an arbitrarily lower rate than the >>> desired one. >>> >>> This commit addresses this issue by replacing per-CPU timers with >>> per-command timers, i.e., by associating an individual timer with each >>> command. >> >> Functionally the patch looks fine. My only worry is that a timer per command >> would be an unnecessary slowdown compared to pushing one timer forward. The >> problem should be fixable by still doing that, just maintaining next-expire >> instead. Maybe something that would still roughly be precise enough, while >> still getting some completion batching going? Or maybe that would be slower, >> and the individual timers are still better. >> >> Comments? >> > > I have tried to think about these questions. Unfortunately I was not able to > go beyond the following general considerations. > > Given that: > 1) hrtimer_start is very efficient; > 2) to implement a batch-by-batch execution of command completions, a more > complex code would of course be needed in null_blk; > I think that, to get a perceptible improvement, command completions should > probably be executed in batches of at least 3 or 4 commands each. > > In this respect, commands can accumulate in a batch only if the time > granularity with which delayed commands are guaranteed to be executed, i.e, > the granularity with which timers eventually fire, happens to be coarser than > completion_nsec. But this would lead to the exact uncontrolled > throughput-loss problem addressed by this patch, although at a lesser extent. > The reason is as follows. > > Consider a batch of 3 or 4 commands, executed at a certain timer expiration. > If the arrival time of these commands is close to the timer expiration, then > the commands happen to be executed with a delay close to the expected one. > If, on the other hand, their arrival time is far from the timer expiration, > then they are completed with a larger delay. In the worst case, their delay > is inflated by a factor proportional to the size of the batch, namely 3 or 4. > > In the end, depending on the command arrival pattern, the throughput of the > emulated device may vary by 3 or 4 times. This would be a reduced version of > the issue addressed by this patch. In fact, with the current implementation > of delayed completions (per-cpu timers), the throughput may vary, in the > worst case, by a factor equal to the queue depth (64 by default). > > Finally, a side note: as for whether the increased efficiency of batched > delayed completions is worth additional code complexity as well as throughput > losses/fluctuations, the low delays for which this efficiency becomes > important match the latencies of new very fast devices (or are at least in > the same order). But, exactly for the high speed of these devices, what > typically matters (AFAIK) in tests with these devices is understanding > whether the rest of the system can cope with their speed. And for this type > of measures, the best/typical choice is (again, AFAIK) not to delay request > completions at all in null_blk. > > I hope these considerations may be somehow useful. > > Thanks, > Paolo > >> -- >> Jens Axboe > > > -- > Paolo Valente > Algogroup > Dipartimento di Fisica, Informatica e Matematica > Via Campi, 213/B > 41125 Modena - Italy > homepage: http://algogroup.unimore.it/people/paolo/ -- Paolo Valente Algogroup Dipartimento di Fisica, Informatica e Matematica Via Campi, 213/B 41125 Modena - Italy homepage: http://algogroup.unimore.it/people/paolo/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/