On 10/23/2018 10:33 AM, Paolo Bonzini wrote:
On 22/10/2018 23:28, George Kennedy wrote:
As you suggested I moved the loading of "s->resel_dsp" down to the "Wait
Reselect"
case. The address of the Reselection Scripts, though, is contained in "s->dsp -
8"
and not in s->dnad.
Are you sure? s->dsp - 8 should be the address of the Wait Reselect
instruction itself. But you're right that s->dnad is the address at
which to jump "if the LSI53C895A is selected before being reselected"
(as the spec puts it) so the reselection DSP should be just s->dsp.
See within the 1st 25 lines of lsi_execute_script() where dsp is bumped
up by 8, "s->dsp += 8", so it needs to be adjusted back to what it was.
The reason the timeout is needed is that under heavy IO some pending commands
stay on the pending queue longer than the 30 second command timeout set by the
linux upper layer scsi driver (sym53c8xx). When command timeouts occur, the
upper layer scsi driver sends SCSI Abort messages to remove the timed out
commands. The command timeouts are caused by the fact that under heavy IO,
lsi_reselect() in qemu "hw/scsi/lsi53c895a.c" is not being called before the
upper layer scsi driver 30 second command timeout goes off.
If lsi_reselect() were called more frequently, the command timeout problem would
probably not occur. There are a number of places where lsi_reselect() is
supposed
to get called (e.g. at the end of lsi_update_irq()), but the only place that I
have observed lsi_reselect() being called is from lsi_execute_script() when
lsi_wait_reselect() is called because of a SCRIPT "Wait Select" IO Instruction.
Reselection should only happen when the target needs access to the bus,
which is when I/O has finished. There should be no need for such a
deadline; reselection should already be happening at the right time when
lsi_transfer_data calls lsi_queue_req, which in turn calls lsi_reselect.
Agree that it should happen as you describe, but under heavy IO (fio),
it does not.
When it works as expected the check for "s->waiting == 1" (Wait Reselect
instruction has been issued) in lsi_transfer_data() is true. Under heavy
IO, s->waiting is not "1" for an extended period of time and
lsi_queue_req() does not get called, which leaves any pending commands
"stuck" on the queue because lsi_reselect() does not get called.
The Scripts are the only place where lsi_wait_reselect() is called and
the only place where "s->waiting = 1" is set. So, the delay in getting a
Scripts Wait Reselect command is the root cause of the problem.
The check in lsi_transfer_data() where it decides whether to call
lsi_queue_req() is probably the preferred place to add a fix, but I have
not been able to come up with a fix here that does not run into problems
because of Script state.
Maybe many of the places that call lsi_irq_on_rsl(s) also need to check
s->want_resel?
I've added debug to all the places where lsi_reselect() should be
called, but under heavy IO lsi_reselect() does not get called for a
period of time exceeding the upper layer's 30 second command timeout,
hence the need for the patch which injects a Scripts Wait Reselect IO
command.
My test setup consists of 5 remote iscsi disks. Here are the fio write
arguments, which show the problem:
[global]
bs=256k
iodepth=2
direct=1
ioengine=libaio
randrepeat=0
group_reporting
time_based
runtime=60
numjobs=40
name=test
rw=write
[job1]
filename=/dev/sda
filename=/dev/sdb
filename=/dev/sdc
filename=/dev/sdd
filename=/dev/sde
I am not strongly attached to my proposed fix. If an alternative fix can
be suggested, I'd be more than willing to try that.
Thank you,
George
Paolo