Re: [Bug 205398] [regression] [tty] tty_drain() kernel function lacks timeout support it had before

Eugene Grosbein Fri, 18 Dec 2015 08:58:28 -0800

18.12.2015 23:05, Bruce Evans пишет:

On Fri, 18 Dec 2015 a bug system that doesn't want replies wrote:

Revision 181905 by e...@freebsd.org brought the new MPSAFE TTY layer and removed
"drainwain" timeout support. Now applications working with serial port can hang
forever on close() system call:

It brought many other bugs.  About 20 more related to draining.

Some of the other bugs accidentally ameliorate this one.  The tty layer
never waits long enough for the last few characters to drain (though
I finished fixing this for sio in 1996).  So it takes a large buffer
to possibly give an endless wait.  Flow control must be on for the
wait to be endless.  Flow control is also broken...

There is a hack for last-close that is supposed to give a hard-coded timeout
of 1 second.  Not sure why this doesn't work for you.  My quick fix that
restores the timeout uses slightly different logic where this hack was.


I've made a mistake (now corrected) while filling PR: my system is 9.3-STABLE
and not 10.2-STABLE. It has no "leaving" case hack.

The timeout is also a hack (breaks POSIX conformance), but at least the
user can control it and it doesn't default to a too small value.  The
old default of 300 seconds was a bit too large, but I kept it.  My systems
have always changed this to 180 seconds in /etc/rc.d.  I set it to 1
second per-device only transiently.

- an application opens /dev/cuau0 in non-blocing i/o mode and tries to detect
GSM gateway there writing commands like ATZ, ATE1 etc. to the device;
- the device may be dead (lost power, broken, disconnected etc.) and does not
answer back;


Old versions also had a hack by me that breaks waiting in last-close if
the device is in non-blocking mode.

If the device is really disconnected, then the tty should be in a zombie
state and should not wait.  I think this still works.  CLOCAL or lack of
modem signals may break detection of last-close.


The device does not get disconnected in process, it was not connected
from the moment of open().

Did you have CRTSCTS flow control enabled?  This is probably the main
source of hangs.  The RTS and CTS signals are not ignored in CLOCAL mode,
flow control should be invoked when they go away when th device goes
away.


It has both of CRTSCTS flow control and CLOCAL enabled and
I'd like to keep them both enabled and working.

- application timeouts waiting for answer and closes device with close()
- tty layer tries to drain output "forever", until a signal arrives.


Perhaps the hard-coded 1 second timeout only works for close() in exit().
So it helps more for sloppy applications that exit without waiting for
their data to go out.

Applications that do the above are still sloppy.  POSIX specifies waiting
"forever" again to drain in close().  A non-buggy application would do:

      write();
      // set up timeout for draining
      tcdrain();
      // when timeout expires, try to recover
      // when recovery is impossible, clean up and exit
      tcflush();        // this is a critical step in the cleanup
      // set up timeout for closing, just in case there is a kernel bug
      close();        // now it can't block unless there was a kernel bug

gnokii (comms/gnokii) suffers from this problem.

Please re-implement tunable timeout and TIOCSDRAINWAIT syscall kernel has
before.


This is mostly fixed in my version.  I started to cut out the patches,
but they were too entwined with other fixes.  Here is the part that
replaces the hard-coded 1 second timeout:

X diff -c2 ./kern/tty.c~ ./kern/tty.c
X *** ./kern/tty.c~    Thu Mar 19 18:23:08 2015
X --- ./kern/tty.c    Sat Aug  8 11:40:23 2015
X ***************
X *** 133,155 ****
X           return (0);
X X !     while (ttyoutq_bytesused(&tp->t_outq) > 0) {
X           ttydevsw_outwakeup(tp);
X           /* Could be handled synchronously. */
X           bytesused = ttyoutq_bytesused(&tp->t_outq);
X !         if (bytesused == 0)
X               return (0);
X X           /* Wait for data to be drained. */
X !         if (leaving) {
X               revokecnt = tp->t_revokecnt;
X !             error = tty_timedwait(tp, &tp->t_outwait, hz);
X               switch (error) {
X               case ERESTART:
X                   if (revokecnt != tp->t_revokecnt)
X                       error = 0;
X                   break;
X               case EWOULDBLOCK:
X !                 if (ttyoutq_bytesused(&tp->t_outq) < bytesused)
X                       error = 0;
X                   break;
X               }
X --- 196,225 ----
X           return (0);
X X !     while (ttyoutq_bytesused(&tp->t_outq) != 0 || tp->t_flags & TS_BUSY) {


Strange diff format... Should patch(1) apply this with all those X'es ?

Thank you for answer, anyway! I'll try to understand and test patches next week.

For a quick fix, try turning off flow control (both hardware and software)
in last-close.  This should limit the wait.  Only large buffers or small
speeds take very long to drain if draining is not blocked completely by
flow control.  I use small speeds to test bugs in this area.  E.g., at 50
bps, a 4K buffer takes 800 seconds to drain; at 1 bps, it takes 40960
seconds to drain.  This shows how broken a hard-coded timeout of 1 or
even 300 seconds is.  Also, how broken an application that doesn't do
its own draining and error handling is.


Well, I just use port comms/gnokii to talk to my GSM gateways via serial port
to send SMS with one-time security codes to my customers and occasionally
informational SMS. If one GSM gateway would fail, I'd like gnokii not to hang
so my script would proceed with backup gateway. Meantime, I've ported timeout(1)
to 9.3-STABLE and it kills gnokii if it hangs for too long. But that's ugly.

_______________________________________________
freebsd-bugs@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-bugs
To unsubscribe, send any mail to "freebsd-bugs-unsubscr...@freebsd.org"

Re: [Bug 205398] [regression] [tty] tty_drain() kernel function lacks timeout support it had before

Reply via email to