Hi, I have a couple of (W2K8) servers on a different subnet, network config is correct as far as I can see (routes/gateways added on both subnets, can ping both ways, telnet into 9102 on client from director/sd, telnet into 9103 on sd/dir machine from clients, status client works from bconsole).
The backup commences and the volume files start getting written, bconsole however reports only up to the following lines: 18-May 16:38 DIRHOSTNAME-sd JobId 32487: Job write elapsed time = 00:37:39, Transfer rate = 5.191 M Bytes/second 18-May 16:40 DIRHOSTNAME-sd JobId 32486: Job write elapsed time = 00:39:38, Transfer rate = 5.241 M Bytes/second Normally you get a bunch of VSS lines after that and the summary with an OK. The /var/working/bacula/log file does not contain the above two lines, only a bunch of the intermediate failures on junction points, in fact it freezes in mid line at some point (first other line continues there without newline in between, other director output continues fine afterwards. status dir reports: Running Jobs: Console connected at 18-May-10 17:41 JobId Level Name Status ====================================================================== 32486 Full HOSTNAME1.2010-05-18_16.00.56_18 is running 32487 Full HOSTNAME2.2010-05-18_16.01.03_19 is running The resource monitor on the hosts does not report network activity (i.e. an open connection) to the sd/dir, except when I do a status client on it (which works), and it seems like the (5.0.2) client thinks it has successfully finished the job: *st client=HOSTNAME1-fd Connecting to Client HOSTNAME1-fd at 1.2.3.4:9102 HOSTNAME1-fd Version: 5.0.2 (28 April 2010) VSS Linux Cross-compile Win64 Daemon started 18-May-10 15:54, 1 Job run since started. Heap: heap=0 smbytes=131,202 max_bytes=292,179 bufs=89 max_bufs=274 Sizeof: boffset_t=8 size_t=8 debug=0 trace=1 Running Jobs: Director connected at: 18-May-10 17:45 No Jobs running. ==== Terminated Jobs: JobId Level Files Bytes Status Finished Name ====================================================================== 32486 Full 86,470 12.44 G OK 18-May-10 16:40 HOSTNAME1 ==== * HOSTNAME2 produces similar output. Somewhat later they error out: 18-May 18:04 DIRHOSTNAME-dir JobId 32487: Fatal error: Network error with FD during Backup: ERR=Connection reset by peer 18-May 18:04 DIRHOSTNAME-dir JobId 32487: Fatal error: No Job status returned from FD. 18-May 18:04 DIRHOSTNAME-dir JobId 32487: Error: Bacula DIRHOSTNAME-dir 5.0.2 (28Apr10): 18-May-2010 18:04:13 Build OS: i686-pc-linux-gnu debian 5.0.4 JobId: 32487 Job: HOSTNAME2.2010-05-18_16.01.03_19 Backup Level: Full (upgraded from Incremental) Client: "HOSTNAME2-fd" 5.0.2 (28Apr10) Linux,Cross-compile,Win64 FileSet: "Windows HOSTNAME2 set" 2010-05-18 16:01:03 Pool: "Pool_HOSTNAME2" (From Job resource) Catalog: "MyCatalog" (From Client resource) Storage: "HOSTNAME2_storage" (From Job resource) Scheduled time: 18-May-2010 16:01:01 Start time: 18-May-2010 16:01:05 End time: 18-May-2010 18:04:13 Elapsed time: 2 hours 3 mins 8 secs Priority: 10 FD Files Written: 0 SD Files Written: 85,253 FD Bytes Written: 0 (0 B) SD Bytes Written: 11,726,994,434 (11.72 GB) Rate: 0.0 KB/s Software Compression: None VSS: no Encryption: no Accurate: no Volume name(s): Vol_HOSTNAME2_0001 Volume Session Id: 9 Volume Session Time: 1274189949 Last Volume Bytes: 11,738,369,181 (11.73 GB) Non-fatal FD errors: 0 SD Errors: 0 FD termination status: Error SD termination status: OK Termination: *** Backup Error *** Same for HOSTNAME1 (interestingly, it came right after HOSTNAME2, the order reversed only due to timing apparently, but they fail at exactly the same moment (18:04)): 18-May 18:04 DIRHOSTNAME-dir JobId 32486: Fatal error: Network error with FD during Backup: ERR=Connection reset by peer 18-May 18:04 DIRHOSTNAME-dir JobId 32486: Fatal error: No Job status returned from FD. 18-May 18:04 DIRHOSTNAME-dir JobId 32486: Error: Bacula DIRHOSTNAME-dir 5.0.2 (28Apr10): 18-May-2010 18:04:31 Build OS: i686-pc-linux-gnu debian 5.0.4 JobId: 32486 Job: HOSTNAME1.2010-05-18_16.00.56_18 Backup Level: Full (upgraded from Incremental) Client: "HOSTNAME1-fd" 5.0.2 (28Apr10) Linux,Cross-compile,Win64 FileSet: "Windows HOSTNAME1 set" 2010-05-18 16:00:56 Pool: "Pool_HOSTNAME1" (From Job resource) Catalog: "MyCatalog" (From Client resource) Storage: "HOSTNAME1_storage" (From Job resource) Scheduled time: 18-May-2010 16:00:55 Start time: 18-May-2010 16:00:58 End time: 18-May-2010 18:04:31 Elapsed time: 2 hours 3 mins 33 secs Priority: 10 FD Files Written: 0 SD Files Written: 86,470 FD Bytes Written: 0 (0 B) SD Bytes Written: 12,464,036,220 (12.46 GB) Rate: 0.0 KB/s Software Compression: None VSS: no Encryption: no Accurate: no Volume name(s): Vol_HOSTNAME1_0001 Volume Session Id: 8 Volume Session Time: 1274189949 Last Volume Bytes: 12,475,995,727 (12.47 GB) Non-fatal FD errors: 0 SD Errors: 0 FD termination status: Error SD termination status: OK Termination: *** Backup Error *** After that the director finally sees them as errored out instead of still running (but the clients report OK in the termination status). The /var/bacula/working/log now contains the failure lines as well, again interestingly it continue in mid sentence where it left off before. Is this a networking issue where some "I'm done" packet was lost/held up? If so, does this go to another port (I don't think so), or does it use a special protocol/form so a specific network issue may block that but not everything else? ------------------------------------------------------------------------------ _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users