Re: NFS hard semantics wanted: how to?

torn5 Thu, 30 Dec 2010 20:10:48 -0800

On 12/22/2010 07:09 PM, Mike Christie wrote:

On 12/22/2010 05:57 AM, torn5 wrote:
Hello open-iscsi people
I am approaching iscsi, and I am currently doing some "reliability"tests.
In particular I would like to be able to reboot the target machine
without the initiators to lose data.
Like NFS hard mounts.

[CUT]

These are the errors I see:
[31291.360009] EXT4-fs (sdd1): error count: 10
[31291.360013] EXT4-fs (sdd1): initial error at 1292972264:
ext4_remount:3755
[31291.360015] EXT4-fs (sdd1): last error at 1292976117:ext4_put_super:719
They look harmful...
Could you send the rest of your /var/log/messages? It should have somescsi error code info and block layer error info.
Could you also turn on iscsi eh debugging


Hello Mike
sorry for the delay in the reply, I was doing some zillions of tests...

The error I reported was erroneous, it was all right.

It was an ext4 new misleading feature: after 300 seconds from mount itreports last errors seen on that filesystem (coming from my earliertests) and you can clear that error log only by paying money to Ted Ts'o

just kidding
the log would have been cleared if I had a newer version of fsck.ext4 .

I was seeing those errors spitted out during my disconnection tests andI thought they were due to the disconnections but they were just an old log.

The replacement timeout thing works flawlessly, my congrats on thisexcellent piece of software and for all the information.


Just a few more questions:

1- Can I raise the number of "5" resubmissions from SCSI, possibly byrecompiling the kernel? Do you know & could you tell me where thatnumber is? I grepped the sources but there are too many values and I'mnot sure what is the right one.

2- Wouldn't it be better to have a separate error count for networkerrors? I would raise that one. Why should a network error eat retriesfrom scsi errors? Is it scsi standard that mandates equality of networkfailures and disk failures? Seems strange/unwise to me...


3- this is a kinda bug report / feature request:

I wanted to raise replacement_tmo (via sysfs) to a very high value butit wrapped around. The limit seems to be 2**31/HZ, after that it wraps,it doesn't tell you anything immediately but at the first networkdisconnection it expires immediately like if it was below zero.

Hence, if I have HZ=1000 the max is about 24 days.

It might sound crazy but I would like higher values. The thing is, wehave (virtual) machines with almost-abandoned services, and if thosefreeze for 24 days we might not notice it and then we can start havingerrors and potentially a filesystem corruption. I would like possibly aninfinite timeout, like a magic value that makes the counter neverexpire. Since you seem to be using a signed value and 0 is already usedfor no-timeout, -1 could be a good value imho. Or else use a 64bit valueor compute it in another way so that HZ is used differently/later anddoes not make it wrap around (at that point we could enter about 66years, which would be enough).I tried to look at the source to patch it myself but that value ispassed around a lot and I couldn't really track where it was going.


Thanks for your help

--
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.

Re: NFS hard semantics wanted: how to?

Reply via email to