A number of times in the past, I've run into problems where remote systems were doing bad things wrt window resizing. Basically, they'd stop responding to resizes. This can be really annoying.
Today I had a system of mine do that to me, and I think I tracked down why. Whenever I've googled for this in the past, I haven't really been able to get anywhere with it, so this post is mostly an attempt to record what I found in hopes that it will help future searchers (maybe even myself a couple of months from now). I remember before that when I'd had this problem, restarting sshd seemed to clear it up. Not so today. I knew this had something to do with SIGWINCH not being propagated correctly. I knew it worked most of the time; indeed it had been working fine on my machine for months, but just stopped working today. I had other (remote) systems that seemed to never work, and I didn't really know why. Today my searching did come up with 'ps s', which shows which signals a process has blocked. Of course, my searching didn't give any clues on how to read it, but I figured it out. Here's what I found: pescado1:~# ps s -C sshd UID PID PENDING BLOCKED IGNORED CAUGHT STAT TTY TIME COMMAND 0 31900 00000000 08000000 00001000 80014005 Ss ? 0:14 /usr/sbin/sshd 0 19689 00000000 08000000 00001000 80012000 Ss ? 204:13 sshd: root 0 26486 00000000 08000000 00001000 80012000 Ss ? 0:00 sshd: [EMAIL PROTECTED]/1 See how those say 08000000 in their BLOCKED columns? That actually corresponds to SIGWINCH being blocked. How do you know? run 'kill -l', and it gives you a list of 64 signals. I couldn't really make sense of those 8-character singal lists just yet. Then I went and looked it /proc/$$/status. Now they were shown as 16-character strings (with a bunch of leading zeroes). Those smelled a lot like 64-bit numbers. Indeed, it turns out that they do represent a bit for each of 64 signals. So 08000000 in the BLOCKED column refers to signal 28 (SIGWINCH) being blocked. [1] As you can see, since sshd has that signal blocked, so do all of its child processes. So even if I go to /etc/init.d/ssh restart, that new sshd begins as a child process of my current shell, which also has signal 28 blocked! I guessed that restarting it from the console would work, and it did: [EMAIL PROTECTED]:~$ ps s -C sshd UID PID PENDING BLOCKED IGNORED CAUGHT STAT TTY TIME COMMAND 0 6117 00000000 00000000 00001000 80006001 Ss ? 0:00 sshd: davidb [priv] 5008 6119 00000000 00000000 00001000 80012000 S ? 0:02 sshd: [EMAIL PROTECTED]/7 0 28555 00000000 08000000 00001000 80004001 Ss ? 0:00 sshd: danh [priv] 5054 28557 00000000 08000000 00001000 80010000 S ? 0:00 sshd: [EMAIL PROTECTED]/1 0 28609 00000000 00000000 00001000 00014005 Ss ? 0:00 /usr/sbin/sshd 0 28617 00000000 00000000 00001000 80004001 Ss ? 0:00 sshd: vineet [priv] 5025 28619 00000000 00000000 00001000 80010000 S ? 0:00 sshd: [EMAIL PROTECTED]/5 So you can see that old processes still have their old signal blocks, but the new sshd (started from the console) doesn't, and nor do new child processes from there. Sure enough, window resizes in new sessions work just fine. OK, so why did that happen just now? Well, I had just updated ssh from within aptitude. I decided to watch aptitude more closely. Here I fired it up, and I see this: [EMAIL PROTECTED]:~$ ps s -C aptitude UID PID PENDING BLOCKED IGNORED CAUGHT STAT TTY TIME COMMAND 0 19809 00000000 00000000 00001000 88084426 S+ pts/4 0:01 aptitude Looks fine so far... then I hit 'g'. This time there aren't any packages to install (it just shows me some packages being held back.) ps again ... still looks the same. Give aptitude another 'g', and it pops up a dialog: "Downloaded 0B in 0s (0B/s)". ps again: [EMAIL PROTECTED]:~$ ps s -C aptitude UID PID PENDING BLOCKED IGNORED CAUGHT STAT TTY TIME COMMAND 0 19809 00000000 08000000 00001000 88084426 S+ pts/4 0:01 aptitude Hey! I tell aptitude to go ahead and continue with package installation and I keep running ps in my other window. I see that sig 28 stays blocked until it's done. So any time it upgrades or installs a daemon and invokes its rc script to start it, those start out with SIGWINCH blocked. This was why it was always hosed on my remote machines at the colo; I had updated their sshd from within aptitude a long time ago. Now I don't really have another way in (short of driving to the colo and plugging in a console), so all ssh sessions on those systems are basically hosed (wrt SIGWINCH processing) until they get rebooted. So my remaining questions are: (1) should this be considered an aptitude bug? and (2) Is there an easy way for me to unblock that signal in a shell, so that I can then restart ssh from within that shell to have it start with a clean slate? Actually, if I could figure out an easy way to do it within a running shell, I'd probably go ahead and put that in the init script, so that future aptitude updates wouldn't be able to re-hose it. (3) If not an easy way to do it in a shell, how about from a system call within a process? and finally (4) if so, is it a bug that sshd doesn't explicitly unblock SIGWINCH when starting up (or at least provide an option to do so)? All followup comments and questions are welcome. good times, Vineet [1] Here's some more detail on how to read those numbers: Each character is a "hexit" that represents 4 bits. So for example, 00000001 corresponds to just bit #1, and hence signal #1. 80000000 refers to bit 32, and hence signal 32. 8000001 is the addition of those two, and indicates that signals 32 and 1 are both blocked. 08000000 refers to signal 28 since only the 28th bit from the right is a 1, and all the rest are zeroes. -- http://www.doorstop.net/ -- #include<stdio.h> int main() { puts("Reader! Think not that \n" "technical information \n" "ought not be called speech;"); return 0; }
signature.asc
Description: Digital signature