On 12/22/2010 07:09 PM, Mike Christie wrote:
On 12/22/2010 05:57 AM, torn5 wrote:
Hello open-iscsi people
I am approaching iscsi, and I am currently doing some "reliability" tests.

In particular I would like to be able to reboot the target machine
without the initiators to lose data.
Like NFS hard mounts.

[CUT]

These are the errors I see:
[31291.360009] EXT4-fs (sdd1): error count: 10
[31291.360013] EXT4-fs (sdd1): initial error at 1292972264:
ext4_remount:3755
[31291.360015] EXT4-fs (sdd1): last error at 1292976117: ext4_put_super:719
They look harmful...

Could you send the rest of your /var/log/messages? It should have some scsi error code info and block layer error info.

Could you also turn on iscsi eh debugging


Hello Mike
sorry for the delay in the reply, I was doing some zillions of tests...

The error I reported was erroneous, it was all right.
It was an ext4 new misleading feature: after 300 seconds from mount it reports last errors seen on that filesystem (coming from my earlier tests) and you can clear that error log only by paying money to Ted Ts'o
just kidding
the log would have been cleared if I had a newer version of fsck.ext4 .
I was seeing those errors spitted out during my disconnection tests and I thought they were due to the disconnections but they were just an old log.

The replacement timeout thing works flawlessly, my congrats on this excellent piece of software and for all the information.

Just a few more questions:

1- Can I raise the number of "5" resubmissions from SCSI, possibly by recompiling the kernel? Do you know & could you tell me where that number is? I grepped the sources but there are too many values and I'm not sure what is the right one.

2- Wouldn't it be better to have a separate error count for network errors? I would raise that one. Why should a network error eat retries from scsi errors? Is it scsi standard that mandates equality of network failures and disk failures? Seems strange/unwise to me...

3- this is a kinda bug report / feature request:
I wanted to raise replacement_tmo (via sysfs) to a very high value but it wrapped around. The limit seems to be 2**31/HZ, after that it wraps, it doesn't tell you anything immediately but at the first network disconnection it expires immediately like if it was below zero.
Hence, if I have HZ=1000 the max is about 24 days.
It might sound crazy but I would like higher values. The thing is, we have (virtual) machines with almost-abandoned services, and if those freeze for 24 days we might not notice it and then we can start having errors and potentially a filesystem corruption. I would like possibly an infinite timeout, like a magic value that makes the counter never expire. Since you seem to be using a signed value and 0 is already used for no-timeout, -1 could be a good value imho. Or else use a 64bit value or compute it in another way so that HZ is used differently/later and does not make it wrap around (at that point we could enter about 66 years, which would be enough). I tried to look at the source to patch it myself but that value is passed around a lot and I couldn't really track where it was going.

Thanks for your help

--
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.

Reply via email to