On 12/30/2010 10:09 PM, torn5 wrote:
On 12/22/2010 07:09 PM, Mike Christie wrote:
On 12/22/2010 05:57 AM, torn5 wrote:
Hello open-iscsi people
I am approaching iscsi, and I am currently doing some "reliability"
tests.
In particular I would like to be able to reboot the target machine
without the initiators to lose data.
Like NFS hard mounts.
[CUT]
These are the errors I see:
[31291.360009] EXT4-fs (sdd1): error count: 10
[31291.360013] EXT4-fs (sdd1): initial error at 1292972264:
ext4_remount:3755
[31291.360015] EXT4-fs (sdd1): last error at 1292976117:
ext4_put_super:719
They look harmful...
Could you send the rest of your /var/log/messages? It should have some
scsi error code info and block layer error info.
Could you also turn on iscsi eh debugging
Hello Mike
sorry for the delay in the reply, I was doing some zillions of tests...
The error I reported was erroneous, it was all right.
It was an ext4 new misleading feature: after 300 seconds from mount it
reports last errors seen on that filesystem (coming from my earlier
tests) and you can clear that error log only by paying money to Ted Ts'o
just kidding
:)
the log would have been cleared if I had a newer version of fsck.ext4 .
I was seeing those errors spitted out during my disconnection tests and
I thought they were due to the disconnections but they were just an old
log.
The replacement timeout thing works flawlessly, my congrats on this
excellent piece of software and for all the information.
Just a few more questions:
1- Can I raise the number of "5" resubmissions from SCSI, possibly by
recompiling the kernel? Do you know & could you tell me where that
number is? I grepped the sources but there are too many values and I'm
not sure what is the right one.
It is controlled by the scsi layer and it is hardcoded in
drivers/scsi/sd.h's SD_MAX_RETRIES definition.
2- Wouldn't it be better to have a separate error count for network
errors? I would raise that one. Why should a network error eat retries
from scsi errors? Is it scsi standard that mandates equality of network
failures and disk failures? Seems strange/unwise to me...
It is just a generic counter that says if the command does not complete
in 5 retries then fail it. I don't think the value of 5 is based on
anything that is specific to disks or network behavior (FC drivers for
example do something similar for SAN problems). I think it is just based
on past experience that if the IO is retryable but it does not complete
in 5 retries for any reason it is not going to complete.
It seems rare that you a command would get 3 network errors and 2 disk
errors, so I do not think it has come up before.
When the iscsi driver was getting submitted a long time ago, there was
code to make the retries configurable, but it got rejected. And, I think
on the linux-scsi list every so often someone sends a patch to make the
retries configurable from sysfs, but it does not get picked up. If I can
find the posts I will send them. I think there is more info in them.
3- this is a kinda bug report / feature request:
I wanted to raise replacement_tmo (via sysfs) to a very high value but
it wrapped around. The limit seems to be 2**31/HZ, after that it wraps,
it doesn't tell you anything immediately but at the first network
disconnection it expires immediately like if it was below zero.
Hence, if I have HZ=1000 the max is about 24 days.
It might sound crazy but I would like higher values. The thing is, we
have (virtual) machines with almost-abandoned services, and if those
freeze for 24 days we might not notice it and then we can start having
errors and potentially a filesystem corruption. I would like possibly an
infinite timeout, like a magic value that makes the counter never
expire. Since you seem to be using a signed value and 0 is already used
for no-timeout, -1
What kernel version are you using? We have exactly that :) If you set it
to -1 then you get your infinite timeout.
--
You received this message because you are subscribed to the Google Groups
"open-iscsi" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/open-iscsi?hl=en.