Hi listers,

I have a strange firewall problem with Bacula 2.2.6 running on RHEL4 
(2.6.9-67 but it happens on other RHEL4 kernels too) clients and CentOS5 
server. The description of the problem is... long and ugly so I've 
managed to narrow it down to the following easy (for me) to reproduce 
scenario:

1. One RHEL4 Bacula 2.2.6 client, 192.168.1.25. Relevant iptables in 
this client:

-A RH-Firewall-1-INPUT -p tcp --dport 9101:9103 -j ACCEPT
-A RH-Firewall-1-INPUT -p udp --dport 9101:9103 -j ACCEPT

2. One Bacula 2.2.6 server, 192.168.1.48. Relevant iptables in this server:

-A RH-Firewall-1-INPUT -p tcp --dport 9101:9103 -j ACCEPT
-A RH-Firewall-1-INPUT -p udp --dport 9101:9103 -j ACCEPT

Although there is no 3Com router involved "Hearbeat Interval" is set to 
60s. 

Now, simply start a 23GB restore (full plus a differential) consisting 
of ~70.000 files on the client... everything works as expected for like 
30 minutes during which the client writes 23GB. Then things start to go 
strange:

1. On the client there is no activity
2. On the server bacula-sd is busy on CPU and I/O most likely searching 
through the 10 x 200GB disk volumes for the differential files to restore.

This "state" will last for another ~30 minutes during which a tcpdump 
will only hear the pings from the heartbeat. Depending on whether the 
firewalls are started or not the end can be one of the following:

No firewall: restore job always ends successfully.
No firewall: Depending on the positions of the planets either the job 
will succeed THREE HOURS later =:-o or (more likely...) it'll fail with 
a "no route to host" error.  Tcpdump started when baculs-sd's job is 
nearing the end will clearly show the culprit:

[... Heartbeat...]

18:32:01.504760 IP server.gbif.org.9103 > client.gbif.org.32776: P 
1560794395:1560794427(32) ack 1414218623 win 181 <nop,nop,timestamp 4070418385 
22509939>
18:32:01.504801 IP client.gbif.org > server.gbif.org: icmp 92: host 
client.gbif.org unreachable - admin prohibited
18:32:01.505214 IP server.gbif.org.9103 > client.gbif.org.32776: . 
32:1480(1448) ack 1 win 181 <nop,nop,timestamp 4070418386 22509939>
18:32:01.505231 IP client.gbif.org > server.gbif.org: icmp 556: host 
client.gbif.org unreachable - admin prohibited
18:32:01.505236 IP server.gbif.org.9103 > client.gbif.org.32776: . 
1480:2928(1448) ack 1 win 181 <nop,nop,timestamp 4070418386 22509939>
18:32:01.505249 IP client.gbif.org > server.gbif.org: icmp 556: host 
client.gbif.org unreachable - admin prohibited

To me it looks like the essence of the problem is the fact that the 
restore session has a long "network idle" period and somehow the RELATED 
mechanism of the firewall no longer works. WHY would this happen? And 
more important, isn't this what HeartBeat was supposed to prevent in the 
first place? One more detail: if the client is RHEL5 everything works 
perfectly.

Has anyone seen something like this before? Any ideas will be 
appreciated! :-|




-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to