I'm seeing, to me, totally illogical behavior with my IPv4 networking. Can someone please help me isolate the problem better?
I have at least EIGHT servers with the same symptom. All are running Oracle "Unbreakable Enterprise Kernel 2". Oracle numbers this kernel 2.6.39.*, but it is "based on the 3.0.16 kernel". I don't know exactly what patches might have been applied. The symptom I see is: I'm SSH'ed into the server from my desk another network. All is well. Then either (1) SSH freezes, or (2) I exit SSH, and can't SHH to it again. Then I ping the server from my desk. It FAILS. I ping the server from a second machine on my desk (same network). It works. If I keep pinging from my desktop, where the SSH just failed, it will NEVER get a response. I've let it ping for DAYS. But if I stop pinging for 5 minutes or so, it'll work just fine again. While things are "hosed", I am able to ping and ssh from my second desktop to the server just fine. If I SSH to the server, it CAN ping my desktop, but it CANNOT traceroute to it. If I leave the ping going (and failing), and go to the server and "ip route flush cache", the pings start working immediately. I can get the problem from other desktops on other networks, but I have never seen it from another server on the same network. It gets stranger. Here are some commands run on the server, while the pings from my desktop are failing. The failing pings are coming from 192.168.118.22. The machine right next that one is .23, and it works fine. I have ONE NIC in the box, and I have no reason to think it isn't configured properly. # ifconfig -a eth0 Link encap:Ethernet HWaddr 00:50:56:9A:00:17 inet addr:172.16.2.95 Bcast:172.16.255.255 Mask:255.255.0.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:246266059 errors:0 dropped:85001 overruns:0 frame:0 TX packets:290982046 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:70745127855 (65.8 GiB) TX bytes:27490797799 (25.6 GiB) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:258548668 errors:0 dropped:0 overruns:0 frame:0 TX packets:258548668 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:226377171068 (210.8 GiB) TX bytes:226377171068 (210.8 GiB) The server can ping my desktop just fine: # ping 192.168.118.22 PING 192.168.118.22 (192.168.118.22) 56(84) bytes of data. 64 bytes from 192.168.118.22: icmp_seq=1 ttl=127 time=0.827 ms 64 bytes from 192.168.118.22: icmp_seq=2 ttl=127 time=0.739 ms 64 bytes from 192.168.118.22: icmp_seq=3 ttl=127 time=0.725 ms But a traceroute to the same destination says "network is down": # traceroute 192.168.118.22 traceroute to 192.168.118.22 (192.168.118.22), 30 hops max, 40 byte packets send: Network is down A syscall trace of traceroute shows the sendto() call getting a ENETDOWN response: socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 3 setsockopt(3, SOL_IP, IP_MTU_DISCOVER, [0], 4) = 0 setsockopt(3, SOL_SOCKET, SO_TIMESTAMP, [1], 4) = 0 fcntl(3, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 setsockopt(3, SOL_IP, IP_TTL, [1], 4) = 0 setsockopt(3, SOL_IP, IP_RECVERR, [1], 4) = 0 connect(3, {sa_family=AF_INET, sin_port=htons(33434), sin_addr=inet_addr("192.168.118.22")}, 28) = 0 sendto(3, "@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_"..., 40, 0, NULL, 0) = -1 ENETDOWN (Network is down) Yet traceroute (and ping) to a machine on the same network is fine: # traceroute 192.168.118.23 traceroute to 192.168.118.23 (192.168.118.23), 30 hops max, 40 byte packets 1 172.16.16.253 (172.16.16.253) 1.304 ms 1.614 ms 1.886 ms 2 192.168.118.23 (192.168.118.23) 0.521 ms 0.566 ms 0.562 ms I have a default route, and no other routes defined: # netstat -nr Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface 0.0.0.0 172.16.0.5 0.0.0.0 UG 0 0 0 eth0 169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth0 172.16.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth0 Here are my route cache entries for the network I'm trying to talk to: # netstat -nrC|grep 192.168.118 172.16.2.95 192.168.118.22 172.16.70.101 1500 0 239 eth0 192.168.118.23 172.16.2.95 172.16.2.95 l 16436 0 0 lo 172.16.2.95 192.168.118.23 172.16.70.101 1500 0 0 eth0 192.168.118.22 172.16.2.95 172.16.2.95 l 16436 0 0 lo 172.16.2.95 192.168.118.22 172.16.70.101 1500 0 239 eth0 172.16.2.95 192.168.118.23 172.16.70.101 1500 0 0 eth0 172.16.2.95 192.168.118.22 172.16.70.101 1500 0 239 eth0 172.16.2.95 192.168.118.23 172.16.70.101 1500 0 0 eth0 172.16.2.95 192.168.118.23 172.16.70.101 1500 0 0 eth0 And finally, tcpdump shows that the pings from my desktop ARE arriving. They are simply not being replied to: # tcpdump -np host 192.168.118.22 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes 10:20:48.950240 IP 192.168.118.22 > 172.16.2.95: ICMP echo request, id 2, seq 35155, length 40 10:20:54.956584 IP 192.168.118.22 > 172.16.2.95: ICMP echo request, id 2, seq 35158, length 40 10:21:00.959048 IP 192.168.118.22 > 172.16.2.95: ICMP echo request, id 2, seq 35161, length 40 10:21:06.964326 IP 192.168.118.22 > 172.16.2.95: ICMP echo request, id 2, seq 35164, length 40 If you could PLEASE advise me on where to go from here, I would greatly appreciate it. I can't imagine what would cause these symptoms. Here is the ver_linux output: Linux jidlam01.acbl.net 2.6.39-200.29.1.el5uek #1 SMP Fri Jul 6 08:01:33 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux Gnu C 4.1.2 Gnu make 3.81 binutils 2.17.50.0.6 8.3 util-linux 2.13-pre7 mount 2.13-pre7 module-init-tools 3.3-pre2 e2fsprogs 1.39 pcmciautils 014 quota-tools 3.13. PPP 2.4.4 Linux C Library 2.5 Dynamic linker (ldd) 2.5 Procps 3.2.7 Net-tools 1.60 Kbd 1.12 Sh-utils 5.97 udev 095 wireless-tools 28 Modules Loaded autofs4 hidp rfcomm bluetooth rfkill lockd sunrpc be2iscsi iscsi_boot_sysfs ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp bnx2i cnic uio ipv6 cxgb3i libcxgbi cxgb3 mdio libiscsi_tcp libiscsi scsi_transport_iscsi video sbs sbshc hed acpi_memhotplug acpi_ipmi ipmi_msghandler lp sg sr_mod cdrom snd_seq_dummy serio_raw e1000 vmw_balloon snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc pcspkr parport_pc i2c_piix4 i2c_core parport floppy pata_acpi ata_generic dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod ata_piix shpchp mptspi mptscsih mptbase scsi_transport_spi sd_mod crc_t10dif ext3 jbd mbcache Terry Phelps American Commercial Lines Jeffersonville, IN -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/