Hi Y'all, We have two RHEL 6 servers, one is running bacula-dir and bacula-sd (with a lot of disk storage for the backups), and the other is the client we want to back up from running bacula-fd.
Both servers are behind different firewalls, and each has a public IP address that is configured for 1:1 NAT to the internal server. I can telnet from the director to the client on port 9102 and connect OK, and the client can telnet to the storage daemon on the other server on port 9103 and connect OK. So, I think the network path is working. What appears to happen is that we start the backup job, and it connects and begins transferring and then almost immediately stops and then times out. It transfers like 800 bytes and seems to just sit there. This is what it looks like on the different status reports from bconsole on the directory node: DIRECTOR STATUS: *status Status available for: 1: Director 2: Storage 3: Client 4: All Select daemon type for status (1-4): 1 bacula-dir Version: 5.0.0 (26 January 2010) x86_64-unknown-linux-gnu redhat (Final) Daemon started 15-Jun-12 09:39, 0 Jobs run since started. Heap: heap=135,168 smbytes=61,022 max_bytes=61,278 bufs=247 max_bufs=253 Scheduled Jobs: Level Type Pri Scheduled Name Volume =================================================================================== Incremental Backup 10 16-Jun-12 02:00 hubcon02 Default0002 ==== Running Jobs: Console connected at 15-Jun-12 09:52 JobId Level Name Status ====================================================================== 19 Full hubcon02.2012-06-15_09.53.03_03 is running ==== Terminated Jobs: JobId Level Files Bytes Status Finished Name ==================================================================== 4 Full 0 0 Error 13-Jun-12 14:41 hubcon02 5 Full 0 0 Error 13-Jun-12 15:40 hubcon02 8 Full 0 0 Error 13-Jun-12 22:10 hubcon02 10 Full 10 173.5 K Error 13-Jun-12 22:56 hubcon02 11 Full 0 0 Error 13-Jun-12 23:06 hubcon02 13 Full 0 0 Error 14-Jun-12 01:30 hubcon02 14 Full 10 107.9 K Error 14-Jun-12 02:15 hubcon02 15 Full 0 0 Error 15-Jun-12 04:11 hubcon02 16 Full 10 108.2 K Cancel 15-Jun-12 08:05 hubcon02 17 Full 10 108.2 K Error 15-Jun-12 08:22 hubcon02 ==== * STORAGE STATUS: *status Status available for: 1: Director 2: Storage 3: Client 4: All Select daemon type for status (1-4): 2 The defined Storage resources are: 1: storage-hubcon01 2: storage-hubcon02 3: storage-app01 Select Storage resource (1-3): 2 Connecting to Storage daemon storage-hubcon02 at 128.114.61.132:9103 bacula-sd Version: 5.0.0 (26 January 2010) x86_64-unknown-linux-gnu redhat (Final) Daemon started 15-Jun-12 09:24, 0 Jobs run since started. Heap: heap=131,072 smbytes=304,519 max_bytes=304,713 bufs=149 max_bufs=151 Sizes: boffset_t=8 size_t=8 int32_t=4 int64_t=8 Running Jobs: Writing: Full Backup job hubcon02 JobId=19 Volume="Default0002" pool="Default" device="backup-hubcon02" (/data/backup/backup-hubcon02) spooling=0 despooling=0 despool_wait=0 Files=4 Bytes=52,038 Bytes/sec=634 FDReadSeqNo=32 in_msg=23 out_msg=5 fd=8 ==== Jobs waiting to reserve a drive: ==== Terminated Jobs: JobId Level Files Bytes Status Finished Name =================================================================== 8 Full 0 0 Cancel 13-Jun-12 22:10 hubcon02 9 Full 2 631 Cancel 13-Jun-12 22:32 hubcon02 10 Full 4 51.80 K Cancel 13-Jun-12 23:05 hubcon02 11 Full 0 0 Error 13-Jun-12 23:06 hubcon02 12 Full 0 0 Cancel 13-Jun-12 23:18 hubcon02 13 Full 2 631 Error 14-Jun-12 01:19 hubcon02 14 Full 2 631 Cancel 14-Jun-12 02:15 hubcon02 15 Full 2 860 Error 15-Jun-12 04:00 hubcon02 16 Full 2 860 Cancel 15-Jun-12 08:05 hubcon02 17 Full 2 860 Cancel 15-Jun-12 08:22 hubcon02 ==== Device status: Device "backup-hubcon01" (/data/backup/backup-hubcon01) is not open. Device "backup-hubcon02" (/data/backup/backup-hubcon02) is mounted with: Volume: Default0002 Pool: Default Media type: File Total Bytes=215 Blocks=0 Bytes/block=215 Positioned at File=0 Block=215 Device "backup-app01" (/data/backup/backup-app01) is not open. ==== Used Volume status: Default0002 on device "backup-hubcon02" (/data/backup/backup-hubcon02) Reader=0 writers=2 devres=0 volinuse=1 ==== ==== * CLIENT STATUS: *status Status available for: 1: Director 2: Storage 3: Client 4: All Select daemon type for status (1-4): 3 The defined Client resources are: 1: hubcon01-fd 2: hubcon02-fd 3: app01-fd Select Client (File daemon) resource (1-3): 2 Connecting to Client hubcon02-fd at bacula-fd.mydomain.com:9102 hubcon02-fd Version: 5.0.0 (26 January 2010) x86_64-unknown-linux-gnu redhat Enterprise 6.0 Daemon started 15-Jun-12 09:26, 1 Job run since started. Heap: heap=0 smbytes=167,245 max_bytes=167,392 bufs=116 max_bufs=117 Sizeof: boffset_t=8 size_t=8 debug=0 trace=0 Running Jobs: JobId 19 Job hubcon02.2012-06-15_09.53.03_03 is running. Full Backup Job started: 15-Jun-12 09:53 Files=10 Bytes=173,753 Bytes/sec=1,368 Errors=0 Files Examined=10 Processing file: /root/.gstreamer-0.10/registry.x86_64.bin SDReadSeqNo=5 fd=5 Director connected at: 15-Jun-12 09:55 ==== Terminated Jobs: JobId Level Files Bytes Status Finished Name ====================================================================== 6 Full 10 107.9 K Error 13-Jun-12 21:46 hubcon02 9 Full 10 107.9 K Cancel 13-Jun-12 22:32 hubcon02 10 Full 10 173.5 K Error 13-Jun-12 22:55 hubcon02 13 Full 10 107.9 K Error 13-Jun-12 23:35 hubcon02 12 Full 0 0 Error 14-Jun-12 01:19 hubcon02 14 Full 10 107.9 K Error 14-Jun-12 02:15 hubcon02 15 Full 10 108.2 K Error 15-Jun-12 02:16 hubcon02 16 Full 10 108.2 K Error 15-Jun-12 08:05 hubcon02 17 Full 10 108.2 K Error 15-Jun-12 08:22 hubcon02 18 Full 10 108.2 K Error 15-Jun-12 09:43 hubcon02 ==== * So, the storage says it transferred 52,038 bytes almost immediately after the job started and then transferred no more, I check for a while after. Sometimes it tranfers only 800 bytes then just waits. Eventually the job times out with: 15-Jun 02:00 bacula-dir JobId 15: No prior Full backup Job record found. 15-Jun 02:00 bacula-dir JobId 15: No prior or suitable Full backup found in catalog. Doing FULL backup. 15-Jun 02:00 bacula-dir JobId 15: Start Backup JobId 15, Job=hubcon02.2012-06-15_02.00.00_10 15-Jun 02:00 bacula-dir JobId 15: Using Device "backup-hubcon02" 15-Jun 02:00 bacula-sd JobId 15: Volume "Default0002" previously written, moving to end of data. 15-Jun 02:00 bacula-sd JobId 15: Ready to append to end of Volume "Default0002" size=215 15-Jun 04:00 bacula-sd JobId 15: Fatal error: append.c:242 Network error reading from FD. ERR=Connection reset by peer 15-Jun 04:00 bacula-sd JobId 15: Job write elapsed time = 02:00:00, Transfer rate = 0 Bytes/second 15-Jun 04:00 bacula-sd JobId 15: Error: bsock.c:518 Read error from client:111.21.21.21:36643: ERR=Connection reset by peer 15-Jun 04:11 bacula-dir JobId 15: Fatal error: Network error with FD during Backup: ERR=Connection timed out 15-Jun 04:11 bacula-dir JobId 15: Fatal error: No Job status returned from FD. 15-Jun 04:11 bacula-dir JobId 15: Error: Bacula cghub-bacula-sc-dir 5.0.0 (26Jan10): 15-Jun-2012 04:11:21 It *does* seem like the FD can connect back to the DIR and SD, so I don't know why it would stop transferring immediately after starting and then eventually time out. I've confirmed my firewalls are not timing out the connections. I also have "Heartbeat Interval = 60" in the SD, DIR and FD config files. But the reason I don't think it's a firewall timeout is because it almost immediately stops transferring after only like 1 second. Anyone have any ideas as to what is happening? Thanks in advance!! -erich ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users