Just as soon as you think you have it nailed down, my test machine failed 8 minutes and 4G into the transfer. Now that's a first. I did manage to do multiple 20G backups of the Windows box that was failing consistently on earlier versions.
There is one network peculiarity of these machines: the host name resolves to multiple IP addresses. The machines that I am backing up are all servers on multiple VLAN's. I changed from FQDN's to IP's and restarted all the daemons. No change -- same failure. I traced all ports between devices and switches and check the error rate. Absolutely clean connections. The server adapter is a realtek 8169. And though I have often been suspicious of RealTek, I guess it is one of the primary non-Intel chips. I'm not against using another ethernet adapter, but I don't have one on hand. So that test will have to wait. Actually there is one more dirty little secret. A couple of months ago, this server died. The network chip on the motherboard was fried (a paper tag on the top of it was visibly scorched) and the machine wouldn't even get to a BIOS screen. This was most likely due to lightening in the area. I was in a hurry to get the machine back into production, so I literally took a screwdriver and popped the old ethernet controller chip off the motherboard. It solved my problem and the machine booted. I slipped in a gigabit network board, put the machine on a different UPS and different switch, and 30 minutes later I was back in business. Incidentally the switch was unharmed. Honestly, I couldn't make up something that good. In spite of the machine's history, the realtek adapter has nothing electrically in common with the disabled circuitry on the motherboard. Spurious bus errors would bring the whole machine to a stop. I guess I should change the subject of the posts for posterity sake. I think I'm whittling away the options. bbaker >I found the documentation on the heartbeat, configured it for the FD and >SD for 5 sec, restarted the deamons, and ran the test again. On the >primary test machine, the backup is still dying in the same place. I >did notice (a little late) that I was probably focusing on the wrong >message. > >The clients and server are seperated by a couple of switches, but they >are on the same subnets, so routers should not be an issue. Most >devices are gigabit on managed switches. Some devices are 100MB. In >particular, the server is gigabit and the primary test client is 100MB. >I plan to trace the route and check the errors on the ports -- starting >with the server. > >For my primary test machine, the point of failure is consistantly around >5 mins into the backup with 2.460 to 2.464 G transferred. > >bbaker > > > >>On Friday 15 September 2006 18:07, William Baker wrote: >> >> >> >> >>>(Thanks for kindly pointing me in the right direction, Kern.) >>> >>>I have a little bit more info to add to the mix -- and a little more >>>confusion. Unix clients are behaving the same way. So, the only thing >>>all these items appear to have in common is the server -- though it >>>would seem strange to me to have such a problem in a production server >>>that has been in use in other places for months. >>> >>>So, I upgraded the server to the latest beta. Surprise: same thing >>>still happened -- "packet size too big". Well. The server is fedora >>>core 4 with up-to-date patches. gcc version 4.0.2. I also failed to >>>mention the server is build-from-source due to a strict mysql version >>>4.1.10 requirement. The clients are RPM's and EXE's. >>> >>>I guess now is the time to dig into the code. At least I have a few >>>verbose error messages to point the way. >>> >>> >>> >>> >>The problem you are having doesn't appear to be packet size too big because >>that was not the first error message, and is likely spurious due to the >>disconnection. >> >>I suspect that you are seeing network problems -- either a bad switch, a bad >>ethernet card, or simply Windows software that doesn't follow Internet rules >>and times out the line during large transfers. The manual discusses several >>reasons for this, including in some cases a Bacula workaround called >>Heartbeat Interval. >> >> >> >> >> >>>bbaker >>> >>> >>> >>> >>> >>>>You will probably have better luck getting your question answered on the >>>>bacula-users list, which I have copied for you. >>>> >>>>On Friday 15 September 2006 15:36, William Baker wrote: >>>> >>>> >>>> >>>> >>>> >>>> >>>>>I know "packet too long" is in the FAQ. I think this is a new but >>>>>related issue. The error is consistant and repeatable. >>>>> >>>>>The server is a production version bacula 1.38.11 running on Linux with >>>>>MySQL database. Two versions of the Windows client have been tested: >>>>>1.38.10 and 1.39.22. Several configurations of the client have been >>>>>tested, but with and without VSS enabled. I have a TODO list that >>>>>includes backing up other (non-windows) clients, but those tests haven't >>>>>been done yet. The traces included below are for 1.39.22. >>>>> >>>>>The client data to backup is approximately 21 GB. For v1.38.10, only >>>>>about 2GB where actually backed up. For 1.39.22 about 20GB were >>>>>retrieved from the client before, the following message appears: >>>>> >>>>>15-Sep 07:36 scott2-sd: mcleod-job.2006-09-15_07.17.39 Fatal error: >>>>>append.c:144 Error reading data header from FD. ERR=No data available >>>>>15-Sep 07:36 scott2-sd: mcleod-job.2006-09-15_07.17.39 Fatal error: >>>>>bnet.c:228 Packet size too big from "client:192.168.4.20:36643. >>>>>Terminating connection. >>>>>15-Sep 07:36 mcleod-fd: mcleod-job.2006-09-15_07.17.39 Fatal error: >>>>>../../filed/backup.c:787 Network send error to SD. ERR=Input/output error >>>>>15-Sep 07:36 mcleod-fd: mcleod-job.2006-09-15_07.17.39 Error: >>>>>../../lib/bnet.c:393 Write error sending len to Storage >>>>>daemon:proe.priefert.com:9103: ERR=Input/output error >>>>>15-Sep 07:38 mcleod-fd: VSS Writer (BackupComplete): "System Writer", >>>>>State: 0x1 (VSS_WS_STABLE) >>>>>15-Sep 07:38 mcleod-fd: VSS Writer (BackupComplete): "MSDEWriter", >>>>>State: 0x1 (VSS_WS_STABLE) >>>>>15-Sep 07:38 mcleod-fd: VSS Writer (BackupComplete): "IIS Metabase >>>>>Writer", State: 0x1 (VSS_WS_STABLE) >>>>>15-Sep 07:38 mcleod-fd: VSS Writer (BackupComplete): "Removable Storage >>>>>Manager", State: 0x1 (VSS_WS_STABLE) >>>>>15-Sep 07:38 mcleod-fd: VSS Writer (BackupComplete): "WMI Writer", >>>>>State: 0x1 (VSS_WS_STABLE) >>>>>15-Sep 07:38 mcleod-fd: VSS Writer (BackupComplete): "Event Log Writer", >>>>>State: 0x1 (VSS_WS_STABLE) >>>>>15-Sep 07:38 mcleod-fd: VSS Writer (BackupComplete): "Registry Writer", >>>>>State: 0x1 (VSS_WS_STABLE) >>>>>15-Sep 07:38 mcleod-fd: VSS Writer (BackupComplete): "COM+ REGDB >>>>>Writer", State: 0x1 (VSS_WS_STABLE) >>>>>15-Sep 07:38 scott2-dir: mcleod-job.2006-09-15_07.17.39 Error: Bacula >>>>>1.38.11 (28Jun06): 15-Sep-2006 07:38:08 >>>>> >>>>>On the client, the last few lines of the bacula.trace file tell a >>>>>similar story: >>>>> >>>>>mcleod-fd: ../compat/compat.cpp:150 Leave cvt_u_to_win32_path >>>>>path=\\?\GLOBALROOT\Device\HarddiskVolumeShadowCopy10\Program Files\ALK >>>>>Technologies\PMW190\Connect\PCMSRV.HLP >>>>>mcleod-fd: ../compat/compat.cpp:90 Enter convert_unix_to_win32_path >>>>>mcleod-fd: ../compat/compat.cpp:141 path=D:\Program Files\ALK >>>>>Technologies\PMW190\Connect\PCMSRV.HLP >>>>>mcleod-fd: ../compat/compat.cpp:150 Leave cvt_u_to_win32_path >>>>>path=\\?\GLOBALROOT\Device\HarddiskVolumeShadowCopy10\Program Files\ALK >>>>>Technologies\PMW190\Connect\PCMSRV.HLP >>>>>mcleod-fd: ../compat/compat.cpp:1107 readdir_r(b64960, { >>>>>d_name="pcmsrv.pdf", d_reclen=10, d_off=66 >>>>>mcleod-fd: ../compat/compat.cpp:177 Enter wchar_win32_path >>>>>mcleod-fd: ../compat/compat.cpp:351 Leave wchar_win32_path=\ >>>>>mcleod-fd: ../compat/compat.cpp:90 Enter convert_unix_to_win32_path >>>>>mcleod-fd: ../compat/compat.cpp:141 path=D:\Program Files\ALK >>>>>Technologies\PMW190\Connect\pcmsrv.pdf >>>>>mcleod-fd: ../compat/compat.cpp:150 Leave cvt_u_to_win32_path >>>>>path=\\?\GLOBALROOT\Device\HarddiskVolumeShadowCopy10\Program Files\ALK >>>>>Technologies\PMW190\Connect\pcmsrv.pdf >>>>>mcleod-fd: ../compat/compat.cpp:90 Enter convert_unix_to_win32_path >>>>>mcleod-fd: ../compat/compat.cpp:141 path=D:\Program Files\ALK >>>>>Technologies\PMW190\Connect\pcmsrv.pdf >>>>>mcleod-fd: ../compat/compat.cpp:150 Leave cvt_u_to_win32_path >>>>>path=\\?\GLOBALROOT\Device\HarddiskVolumeShadowCopy10\Program Files\ALK >>>>>Technologies\PMW190\Connect\pcmsrv.pdf >>>>>mcleod-fd: ../../filed/heartbeat.c:77 Got BNET_SIG 0 from SD >>>>>mcleod-fd: ../../filed/heartbeat.c:82 wait_intr=1 stop=1 >>>>>mcleod-fd: ../../filed/backup.c:184 end blast_data ok=0 >>>>>mcleod-fd: ../../filed/job.c:221 Quit command loop. Canceled=1 >>>>>mcleod-fd: ../../filed/job.c:303 Calling term_find_files >>>>>mcleod-fd: ../../filed/job.c:306 Done with term_find_files >>>>>mcleod-fd: ../../filed/job.c:308 Done with free_jcr >>>>> >>>>>Actually, on the Windows box, I'm trying to back up most of C: and D:. >>>>>Here is what cygwin df says about the data: >>>>> >>>>>C:\WINDOWS\system32>df >>>>>Filesystem 1K-blocks Used Available Use% Mounted on >>>>>C:\cygwin\bin 20482843 9327624 11155219 46% /usr/bin >>>>>C:\cygwin\lib 20482843 9327624 11155219 46% /usr/lib >>>>>C:\cygwin 20482843 9327624 11155219 46% / >>>>>c: 20482843 9327624 11155219 46% /cygdrive/c >>>>>d: 123170320 12428172 110742148 11% /cygdrive/d >>>>> >>>>>While the backup statistics give the following: >>>>> >>>>>JobId: 8 >>>>>Job: mcleod-job.2006-09-15_07.17.39 >>>>>Backup Level: Full >>>>>Client: "mcleod-fd" Linux,Cross-compile,Win32 >>>>>FileSet: "BasicWindowsFileSet" 2006-09-14 21:26:55 >>>>>Pool: "Default" >>>>>Storage: "LTO2" >>>>>Scheduled time: 15-Sep-2006 07:17:36 >>>>>Start time: 15-Sep-2006 07:17:44 >>>>>End time: 15-Sep-2006 07:38:08 >>>>>Elapsed time: 20 mins 24 secs >>>>>Priority: 1 >>>>>FD Files Written: 73,064 >>>>>SD Files Written: 72,744 >>>>>FD Bytes Written: 20,028,562,878 (20.02 GB) >>>>>SD Bytes Written: 20,037,565,140 (20.03 GB) >>>>>Rate: 16363.2 KB/s >>>>>Software Compression: None >>>>>Volume name(s): bacula-1 >>>>>Volume Session Id: 7 >>>>> >>>>>So, the immense majority of the data was sent. I don't yet know enought >>>>>about bacula to know if the difference between the FD Files Written and >>>>>SD Files Written is any kind of clue. >>>>> >>>>>By the way, the beta on Windows looks very promising. I liked what I >>>>>saw. I worked a little with BartPE to build a bootable recovery CD. I >>>>>know the issues associated with that. Can you post off-topic in your >>>>>own post? >>>>> >>>>>bbaker >>>>> >>>>> >>>>>------------------------------------------------------------------------- >>>>>Using Tomcat but need to do more? Need to support web services, security? >>>>>Get stuff done quickly with pre-integrated technology to make your job >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>easier >>>> >>>> >>>> >>>> >>>> >>>> >>>>>Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo >>>>>http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 >>>>>_______________________________________________ >>>>>Bacula-devel mailing list >>>>>[EMAIL PROTECTED] >>>>>https://lists.sourceforge.net/lists/listinfo/bacula-devel >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>------------------------------------------------------------------------- >>>Using Tomcat but need to do more? Need to support web services, security? >>>Get stuff done quickly with pre-integrated technology to make your job >>> >>> >>> >>> >>easier >> >> >> >> >>>Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo >>>http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 >>>_______________________________________________ >>>Bacula-users mailing list >>>Bacula-users@lists.sourceforge.net >>>https://lists.sourceforge.net/lists/listinfo/bacula-users >>> >>> >>> >>> >>> > >------------------------------------------------------------------------- >Using Tomcat but need to do more? Need to support web services, security? >Get stuff done quickly with pre-integrated technology to make your job easier >Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo >http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 >_______________________________________________ >Bacula-users mailing list >Bacula-users@lists.sourceforge.net >https://lists.sourceforge.net/lists/listinfo/bacula-users > > ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users