On Dec 9, 2008, at 2:36 PM, Jason Dixon wrote: > On Tue, Dec 09, 2008 at 09:26:38PM +0000, Allan Black wrote: >> Jason Dixon wrote: >>> Alas, I spoke too soon. The CatalogBackup job failed again last >>> night, >>> usual symptoms. >> >> OK. need to find out what the FD is doing. I would recommend: >> >> truss -o filename -f -a -e -v all -w 2 -p <FD pid> >> >> Is it possible to run the catalog backup during the day, by hand? >> That way you could avoid leaving the truss running (and collecting >> data) all night. > > I've run it 6 times today, no failures yet. Frustrating. Here are > the > results showing all the previous failures, then the successes today. > > -bash-3.2$ echo 'list jobs' | sudo /opt/bacula/sbin/i386/bconsole | > grep BackupCatalog | grep '| f ' > | 60 | BackupCatalog | 2008-11-08 00:36:12 | B | F > | 1 | 965,869,568 | f | > | 116 | BackupCatalog | 2008-11-11 23:14:05 | B | F > | 1 | 614,727,680 | f | > | 286 | BackupCatalog | 2008-11-22 23:14:28 | B | F > | 1 | 5,278,400,512 | f | > | 298 | BackupCatalog | 2008-11-23 23:14:32 | B | F > | 1 | 4,723,965,952 | f | > | 336 | BackupCatalog | 2008-11-26 23:14:30 | B | F > | 1 | 3,979,280,384 | f | > | 361 | BackupCatalog | 2008-11-28 23:16:11 | B | F > | 1 | 2,101,936,128 | f | > | 385 | BackupCatalog | 2008-11-30 23:13:21 | B | F > | 1 | 3,863,216,128 | f | > > -bash-3.2$ echo 'list jobs' | sudo /opt/bacula/sbin/i386/bconsole | > grep BackupCatalog | grep '2008-12-09' > | 486 | BackupCatalog | 2008-12-09 16:46:51 | B | F > | 1 | 3,183,911,677 | T | > | 487 | BackupCatalog | 2008-12-09 16:55:42 | B | F > | 1 | 3,183,912,220 | T | > | 488 | BackupCatalog | 2008-12-09 17:03:22 | B | F > | 1 | 3,183,912,724 | T | > | 489 | BackupCatalog | 2008-12-09 17:18:03 | B | F > | 1 | 3,183,913,233 | T | > | 490 | BackupCatalog | 2008-12-09 17:23:26 | B | F > | 1 | 3,183,913,740 | T | > | 491 | BackupCatalog | 2008-12-09 17:29:53 | B | F > | 1 | 3,183,914,246 | T | > > > -- > Jason Dixon > OmniTI Computer Consulting, Inc. > jdi...@omniti.com > 443.325.1357 x.241 >
I'm seeing similar problems to the one Jason described. However, in addition to the 'Packet Size too big' failure, I see a variety of other error messages as well. FD, SD, and dir are all running on the same machine, disk to disk backup. Some examples of the failures: 20-Dec 05:01 server-sd JobId 328: Fatal error: append.c:190 File index from FD not positive or sequential 20-Dec 05:01 server-sd JobId 328: Job write elapsed time = 00:18:22, Transfer rate = 10.48 M bytes/second 20-Dec 05:01 server-fd JobId 328: Fatal error: backup.c:1003 Network send error to SD. ERR=Broken pipe 20-Dec 05:01 server-fd JobId 328: Error: bsock.c:341 Write error sending -1 bytes to Storage daemon:192.168.1.123:9103: ERR=Broken pipe 20-Dec 05:01 server-dir JobId 328: Error: Bacula server-dir 2.5.16- beta (20Oct08): 20-Dec-2008 05:01:23 Build OS: i386-pc-solaris2.11 solaris 5.11 05-Dec 13:30 server-fd JobId 275: Fatal error: backup.c:957 Network send error to SD. ERR=I/O error 05-Dec 13:30 server-sd JobId 275: Fatal error: append.c:259 Network error on data channel. ERR=No data available 05-Dec 13:30 server-sd JobId 275: Job write elapsed time = 01:36:26, Transfer rate = 50.96 M bytes/second 05-Dec 13:30 server-sd JobId 275: Error: bsock.c:485 Read error from client:192.168.1.123:36643: ERR=No data available 05-Dec 13:30 server-dir JobId 275: Bacula server-dir 2.5.16-beta (20Oct08): 05-Dec-2008 13:30:43 Build OS: i386-pc-solaris2.11 solaris 5.11 10-Oct 10:15 server-sd JobId 130: Fatal error: append.c:180 Malformed data header from FD: 10-Oct 10:15 server-sd JobId 130: Job write elapsed time = 00:03:34, Transfer rate = 13.56 M bytes/second\ 10-Oct 10:15 server-fd JobId 130: Fatal error: backup.c:892 Network send error to SD. ERR=Broken pipe\ 10-Oct 10:15 server-dir JobId 130: Error: Bacula server-dir 2.4.2 (26Jul08): 10-Oct-2008 10:15:49\ Build OS: i386-pc-solaris2.11 solaris 5.11 19-Sep 23:05 server-sd JobId 23: Ready to append to end of Volume "test1" size=140938726135 19-Sep 23:20 server-sd JobId 23: Fatal error: append.c:159 Error reading data header from FD. ERR=No data available 19-Sep 23:20 server-sd JobId 23: Job write elapsed time = 00:15:04, Transfer rate = 8.229 M bytes/second 19-Sep 23:20 server-sd JobId 23: Fatal error: bsock.c:415 Packet size too big from "client:127.0.0.1:36643. Terminating connection. 19-Sep 23:20 server-fd JobId 23: Fatal error: backup.c:892 Network send error to SD. ERR=Broken pipe 19-Sep 23:20 server-fd JobId 23: Error: bsock.c:306 Write error sending 32768 bytes to Storage daemon:127.0.0.1:9103: ERR=Broken pipe 19-Sep 23:20 server-dir JobId 23: Error: Bacula server-dir 2.4.2 (26Jul08): 19-Sep-2008 23:20:18 Build OS: i386-pc-solaris2.11 solaris 5.11 JobId: 23 Job: Client1.2008-09-19_23.05.49 Here's the truss of a failure (I think this is the relevant bit), using the options from Alan's message above: 14457/53: write(6, "\0\080\0 m k Y\r\n M 3 m".., 32772) = 32772 14457/53: read(14, " a p L S p L 9 f 6 + l 2".., 32768) = 32768 14457/53: write(6, "\0\080\0 a p L S p L 9 f".., 32772) = 32772 14457/53: read(14, " I T F a T F B i m 4 S V".., 32768) = 32768 14457/53: write(6, "\0\080\0 I T F a T F B i".., 32772) = 32772 14457/53: read(14, " C O / B f / / v\r\n 9 f".., 32768) = 32768 14457/53: write(6, "\0\080\0 C O / B f / / v".., 32772) = 32772 14457/54: pollsys(0xFE2DFDF0, 1, 0xFE2DFE98, 0x00000000) = 1 14457/54: fd=6 ev=POLLRDNORM rev=POLLRDNORM 14457/54: timeout: 5.000000000 sec 14457/53: read(14, " 6 d L r f 3 3\r\n 3 p q".., 32768) = 32768 14457/54: read(6, 0xFE2DFF7C, 4) Err#131 ECONNRESET 14457/54: lwp_sigmask(SIG_SETMASK, 0xFFBFFEFF, 0x0000FFF7) = 0xFFBFFEFF [0x0000FFFF] 14457/54: lwp_exit() 14457/53: write(6, "\0\080\0 6 d L r f 3 3\r".., 32772) Err#32 EPIPE 14457/53: Received signal #13, SIGPIPE [ignored] 14457/53: time() = 1229375022 14457/53: time() = 1229375022 14457/53: write(4, "\0\0\094 J m s g J o b".., 152) = 152 14457/53: close(14) = 0 14457/53: close(13) = 0 14457/53: close(12) = 0 14457/53: close(11) = 0 14457/53: close(10) = 0 14457/53: close(9) = 0 14457/53: close(8) = 0 14457/53: close(7) = 0 14457/53: nanosleep(0xFE3DEE0C, 0x00000000) = 0 14457/53: tmout: 0.050000000 sec 14457/53: nanosleep(0xFE3DEE0C, 0x00000000) = 0 14457/53: tmout: 0.050000000 sec 14457/53: nanosleep(0xFE3DEE0C, 0x00000000) = 0 14457/53: tmout: 0.050000000 sec 14457/53: nanosleep(0xFE3DEE0C, 0x00000000) = 0 14457/53: tmout: 0.050000000 sec -- Conor Edberg Letters & Science IT UC Santa Barbara ------------------------------------------------------------------------------ _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users