Gustavo Noronha escreveu: > Hello, > > We have a somewhat big setup of Bacula here with lots of clients, some > of them with many differente jobs and filesets (such as the database > servers, which have their datafiles backed up separately and their > archive logs backed up from 30 to 30 minutes). > > Our network is quite unstable, and there's not much I can do about some > of the problems we have (because they are handled by third party > companies). Bacula runs mostly OK, but its SD also seems to be the cause > of some of the problems we have: it segfaults from time to time, and > sometimes it seems like it is stalled, and only starts working again > with a restart. > > Our main backup hardware is a Dell Poweredge T132, connected to a Dell > server via SCSI. We run Debian GNU/Linux sarge, with the backported > Bacula package (1.38.11-5~bpo.1). We upgraded to 1.38 because we were > having a lot of problems with the 1.36 version which seemed to be fixed > in 1.38, and, in fact, they have gone away. The tape drive is this one: > > (scsi1:A:1): 160.000MB/s transfers (80.000MHz DT, offset 31, 16bit) > Vendor: DELL Model: PV-132T Rev: 227D > Type: Medium Changer ANSI SCSI revision: 02 > Vendor: IBM Model: ULTRIUM-TD2 Rev: 37RH > Type: Sequential-Access ANSI SCSI revision: 03 > > The main problem we have is that some jobs sometimes fail because of the > network being unstable, and Bacula decides to mark the tape as failed. > This one, for instance: > > 23-Sep 23:52 lab02-sd: Writing spooled data to Volume. Despooling 197,378,916 > bytes ... > 23-Sep 23:58 lab02-sd: Writing spooled data to Volume. Despooling 658,402,880 > bytes ... > 24-Sep 00:06 rivelino-fd: RIVELINO.2007-09-22_00.01.28 Fatal error: > c:\cygwin\home\kern\bacula\k\src\win32\filed\../../filed/backup.c:500 Network > send error to SD. ERR=Input/output error > [...] > 24-Sep 00:07 lab02-dir: Rescheduled Job RIVELINO.2007-09-22_00.01.28 at > 24-Sep-2007 00:07 to re-run in 60 seconds (24-Sep-2007 00:08). > 24-Sep 00:06 lab02-dir: Job MDS513_TS_TBS_MALADIRETA_IDX.2007-09-24_00.06.24 > waiting 60 seconds for scheduled start time. > 24-Sep 00:07 lab02-dir: Start Backup JobId 6634, > Job=MDS513_TS_TBS_MALADIRETA_IDX.2007-09-24_00.06.24 > 24-Sep 00:07 lab02-sd: 3301 Issuing autochanger "loaded drive 0" command. > 24-Sep 00:07 lab02-sd: 3302 Autochanger "loaded drive 0", result is Slot 4. > 24-Sep 00:07 lab02-sd: 3301 Issuing autochanger "loaded drive 0" command. > 24-Sep 00:07 lab02-sd: 3302 Autochanger "loaded drive 0", result is Slot 4. > 24-Sep 00:07 lab02-sd: Volume "LTO_0004" previously written, moving to end of > data. > 24-Sep 00:08 lab02-sd: MDS513_TS_TBS_MALADIRETA_IDX.2007-09-24_00.06.24 > Error: I cannot write on Volume "LTO_0004" because: The number of files > mismatch! Volume=57 Catalog=56 > 24-Sep 00:08 lab02-sd: Marking Volume "LTO_0004" in Error in Catalog. > 24-Sep 00:09 lab02-dir: Recycled volume "LTO_0006" > > The SD also seems to be having trouble. On weekends we run full jobs for > almost all our clients. This sunday morning I went to check how the > backups were going to find all the jobs mostly stalled. I then checked > the SD, entered bconsole, tried a status storage, straced it, and it > seemed to be doing status information checks over and over again > (.status was appearing from time to time), not much was being done, it > seemed. I then stopped it, started it, and then the jobs went ahead. > This is the part of the log that matters: > > 21-Sep 22:03 mds520-fd: DIR and FD clocks differ by -3 seconds, FD > automatically adjusting. > 21-Sep 22:03 lab02-sd: Spooling data ... > 21-Sep 22:11 lab02-sd: Committing spooled data to Volume "LTO_0002". > Despooling 2,780,433,476 bytes ... > 21-Sep 22:14 lab02-sd: Sending spooled attrs to the Director. Despooling > 3,511 bytes ... > 23-Sep 10:28 mds520-fd: MDS520.2007-09-21_22.01.03 Fatal error: job.c:1614 > Comm error with SD. bad response to Append Data. ERR=N<C3><A3>o h<C3><A1> > dados dispon<C3><AD>veis (NOTE: data unavailable) > 23-Sep 10:28 lab02-dir: MDS520.2007-09-21_22.01.03 Error: Bacula 1.38.11 > (28Jun06): 23-Sep-2007 10:28:48 > JobId: 6299 > Job: MDS520.2007-09-21_22.01.03 > [...] > 23-Sep 10:28 lab02-dir: Start Backup JobId 6303, > Job=MDS513_TS_SYSAUX.2007-09-21_22.30.02 > 23-Sep 10:33 lab02-sd: Spooling data ... > 23-Sep 10:34 lab02-sd: Committing spooled data to Volume "LTO_0008". > Despooling 79,420,983 bytes ... > 23-Sep 10:34 lab02-sd: Sending spooled attrs to the Director. Despooling 293 > bytes ... > > On the SD log: > > 21-Sep 22:08 lab02-sd: Committing spooled data to Volume "LTO_0002". > Despooling 1,172,134,636 bytes ... > 21-Sep 22:11 lab02-sd: Committing spooled data to Volume "LTO_0002". > Despooling 2,780,433,476 bytes ... > 21-Sep 22:14 lab02-sd: Sending spooled attrs to the Director. Despooling 309 > bytes ... > 21-Sep 21:30 21-Sep 06:33 21-Sep 05:57 lab02-sd: Writing spooled data to > Volume. Despooling 5,014,740,654 bytes ... > 23-Sep 10:28 lab02-sd: 3301 Issuing autochanger "loaded drive 0" command. > 23-Sep 10:28 lab02-sd: 3302 Autochanger "loaded drive 0", result is Slot 2. > 23-Sep 10:28 lab02-sd: 3301 Issuing autochanger "loaded drive 0" command. > 23-Sep 10:28 lab02-sd: 3302 Autochanger "loaded drive 0", result is Slot 2. > 23-Sep 10:30 lab02-sd: 3301 Issuing autochanger "loaded drive 0" command. > 23-Sep 10:30 lab02-sd: 3302 Autochanger "loaded drive 0", result is Slot 2. > 23-Sep 10:30 lab02-sd: 3301 Issuing autochanger "loaded drive 0" command. > 23-Sep 10:30 lab02-sd: 3302 Autochanger "loaded drive 0", result is Slot 2. > 23-Sep 10:30 lab02-sd: Volume "LTO_0002" previously written, moving to end of > data. > 23-Sep 10:31 lab02-sd: MDS513_TS_SYSTEM.2007-09-21_23.01.01 Error: I cannot > write on Volume "LTO_0002" because: > The number of files mismatch! Volume=11 Catalog=10 > 23-Sep 10:31 lab02-sd: Marking Volume "LTO_0002" in Error in Catalog. > 23-Sep 10:32 lab02-sd: 3301 Issuing autochanger "loaded drive 0" command. > 23-Sep 10:32 lab02-sd: 3302 Autochanger "loaded drive 0", result is Slot 2. > 23-Sep 10:32 lab02-sd: 3307 Issuing autochanger "unload slot 2, drive 0" > command. > 23-Sep 10:32 lab02-sd: 3304 Issuing autochanger "load slot 8, drive 0" > command. > 23-Sep 10:33 lab02-sd: 3305 Autochanger "load slot 8, drive 0", status is OK. > 23-Sep 10:33 lab02-sd: 3301 Issuing autochanger "loaded drive 0" command. > 23-Sep 10:33 lab02-sd: 3302 Autochanger "loaded drive 0", result is Slot 8. > 23-Sep 10:33 lab02-sd: Recycled volume "LTO_0008" on device "LTO" > (/dev/nst0), all previous data lost. > 23-Sep 10:33 lab02-sd: Spooling data ... > 23-Sep 10:33 lab02-sd: Spooling data ... > 23-Sep 10:33 lab02-sd: Spooling data ... > > Here are full configuration and log files, I just changed the passwords > and tried to hide the file names on the error messages: > > http://mds510.mds.gov.br/~kov/bacula-info.tar.gz > > Anyone has any insight on how we could make Bacula be tougher when these > kinds of network errors happen? Perhaps by adopting the catalog or tape > as canonical and updating the other, instead of marking the tape as in > error? Also, any insight on what is happening to the SD? I am thinking > about limiting the number of concurrent jobs on the SD to 1 to see if it > helps, does that make sense to you? > >
Hi Gustavo. I'm from Brazil too :-) Did you think about create another network just for backups ? Today servers comming with two network cards, and you can use this network just for backup and avoid this problems you have. I also recommend you to upgrade to 2.2.4 version, because of serious bug #935: http://www.bacula.org/downloads/bug-935.txt -- Jeronimo Zucco LPIC-1 Linux Professional Institute Certified NĂșcleo de Processamento de Dados Universidade de Caxias do Sul http://jczucco.blogspot.com ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2005. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users