Re: [Bacula-users] Job Terminated at the director but not the client

Dan Langille Tue, 08 Jan 2008 13:25:02 -0800

Ken Monville wrote:
> 
> On Jan 8, 2008, at 2:36 PM, Dan Langille wrote:
> 
>>
>> On Jan 8, 2008, at 2:30 PM, Ken Monville wrote:
>>> On Jan 8, 2008, at 2:12 PM, Dan Langille wrote:
>>>
>>>> Ken Monville wrote:
>>>>> Hi all,
>>>>> I had a longtime running bacula 1.38.11 environment running on
>>>>> FreeBSD  hosts that worked flawlessly.  Recently I had a drive
>>>>> failure on one  of the clients, rebuilt it from scratch and have
>>>>> been unable to get a  successful backup since.
>>>>> Ironically, I started having issues with the server and decided it
>>>>> was  time for a fresh reinstall of bacula, including an upgrade to
>>>>> 2.2.7 on  all servers and clients.  This was a fresh install, new
>>>>> database and  everything.
>>>>> The client in question has 3 separate backup jobs assigned to it.
>>>>> 2  of 3 work flawlessly.  The third, also the largest, appears to
>>>>> run  successfully until the very end when a 'status director' spits
>>>>> out the  following:
>>>>> *st dir
>>>>> ducati-dir Version: 2.2.7 (24 December 2007) i386-unknown-
>>>>> freebsd6.2  freebsd 6.2-RELEASE-p8
>>>>> Daemon started 03-Jan-08 15:45, 10 Jobs run since started.
>>>>> Heap: heap=1,126,400 smbytes=211,929 max_bytes=212,400 bufs=1,013
>>>>> max_bufs=1,020
>>>>> Scheduled Jobs:
>>>>> Level          Type     Pri  Scheduled          Name
>>>>> Volume
>>>>> = = = = = = = = = = =
>>>>> =
>>>>> =
>>>>> =
>>>>> =
>>>>> =
>>>>> ===================================================================
>>>>> Incremental    Backup    10  09-Jan-08 01:05    Ducati Root
>>>>> *unknown*
>>>>> Incremental    Backup    10  09-Jan-08 01:05    Benelli DB
>>>>> *unknown*
>>>>> Incremental    Backup    10  09-Jan-08 01:05    Benelli Root
>>>>> *unknown*
>>>>> Incremental    Backup    10  09-Jan-08 01:05    Astoria Web
>>>>> *unknown*
>>>>> Incremental    Backup    10  09-Jan-08 01:05    Astoria IMAP
>>>>> *unknown*
>>>>> Incremental    Backup    10  09-Jan-08 01:05    Astoria Root
>>>>> *unknown*
>>>>> Full           Backup    11  09-Jan-08 01:15    BackupCatalog
>>>>> *unknown*
>>>>> ====
>>>>> Running Jobs:
>>>>> JobId Level   Name                       Status
>>>>> =
>>>>> =
>>>>> =
>>>>> ===================================================================
>>>>>    43 Full    BackupCatalog.2008-01-04_01.15.09 is waiting for
>>>>> higher priority jobs to finish
>>>>>    47 Full    Astoria_Web.2008-01-05_01.05.13 has terminated
>>>>>    ...
>>>>> ====
>>>>> Terminated Jobs:
>>>>> JobId  Level    Files      Bytes   Status   Finished        Name
>>>>> =
>>>>> ===================================================================
>>>>>    37  Incr      1,204    223.7 M  OK       04-Jan-08 01:06
>>>>> Ducati_Root
>>>>>    38  Incr         12    4.120 M  OK       04-Jan-08 01:07
>>>>> Astoria_Root
>>>>>    41  Incr          6    13.30 K  OK       04-Jan-08 01:07
>>>>> Benelli_Root
>>>>>    39  Incr        250    47.12 M  OK       04-Jan-08 01:09
>>>>> Astoria_IMAP
>>>>>    42  Incr          0         0   OK       04-Jan-08 01:09
>>>>> Benelli_DB
>>>>>    40  Full     44,265    3.425 G  Cancel   06-Jan-08 11:22
>>>>> Astoria_Web
>>>>>    44  Incr      1,110    203.9 M  OK       06-Jan-08 11:24
>>>>> Ducati_Root
>>>>>    45  Incr         71    4.791 M  OK       06-Jan-08 11:25
>>>>> Astoria_Root
>>>>>    46  Incr        182    45.71 M  OK       06-Jan-08 11:26
>>>>> Astoria_IMAP
>>>>>    48  Incr          6    13.53 K  OK       06-Jan-08 11:26
>>>>> Benelli_Root
>>>>> ====
>>>>> Please notice under the Running Jobs: section, the "Astoria_Web"
>>>>> Job  is indicating "terminated."
>>>>> Now, querying the status of the client itself:
>>>>> *st client=astoria-fd
>>>>> Connecting to Client astoria-fd at astoria.monville.net:9102
>>>>> astoria-fd Version: 2.2.7 (24 December 2007)  i386-unknown-
>>>>> freebsd6.2  freebsd 6.2-RELEASE-p9
>>>>> Daemon started 06-Jan-08 11:24, 2 Jobs run since started.
>>>>> Heap: heap=815,104 smbytes=145,440 max_bytes=484,217 bufs=102
>>>>> max_bufs=168
>>>>> Sizeof: boffset_t=8 size_t=4 debug=0 trace=0
>>>>> Running Jobs:
>>>>> JobId 47 Job Astoria_Web.2008-01-05_01.05.13 is running.
>>>>>    Backup Job started: 06-Jan-08 11:28
>>>>>    Files=44,934 Bytes=3,426,818,210 Bytes/sec=18,871 Errors=0
>>>>>    Files Examined=44,934
>>>>>    Processing file: /wwwroot
>>>>>    SDReadSeqNo=9 fd=5
>>>>> Director connected at: 08-Jan-08 13:54
>>>>> ====
>>>>> Terminated Jobs:
>>>>> JobId  Level    Files      Bytes   Status   Finished        Name
>>>>> =
>>>>> =
>>>>> =
>>>>> ===================================================================
>>>>>    45  Incr         71    4.791 M  OK       06-Jan-08 11:26
>>>>> Astoria_Root
>>>>>    46  Incr        182    45.71 M  OK       06-Jan-08 11:28
>>>>> Astoria_IMAP
>>>>> ====
>>>>> The client appears to believe that the job is still running, but
>>>>> always fails while processing file "/wwwroot" even though that is
>>>>> the  only filesystem in the FileSet. (And I believe everything in
>>>>> it has  already been backed up.)
>>>>
>>>> I suspect the job has finished backing up, and is now spooling file
>>>> attributes to the database.  Check your database server load to
>>>> verify.
>>>>
>>>> Are your jobs set up to spool file attributes?  If so, this occurs
>>>> at the end of the job.
>>>>
>>>>> The client lives outside my firewall and I am using TLS
>>>>> encryption,  although I have tested with it disabled.  I currently
>>>>> have the  "Heartbeat Interval" set to 60 seconds on both the client
>>>>> and the  storage daemon.
>>>>> The only way to "free" up the storage daemon so other jobs can run
>>>>> is  to restart the client fd, then cancel the job at the director,
>>>>> but  then the next time it runs it does another Full backup,
>>>>> upgraded from  Incr.
>>>>> I'm at my wits end as to why this is failing in this manner and am
>>>>> hoping someone can point me in the right direction.
>>>>
>>>> Perhaps nothing is wrong at all.  :)
>>
>>> Hi Dan,
>>>
>>> Double checking my configuration, there is no "Spool Attributes"
>>> parameter set anywhere, so I assume its off.  The database server has
>>> virtually no load and I don't notice anything when the job is hung.
>>> Also, it stays in this state indefinitely until I restart the fd as I
>>> mentioned in this first post...
>>>
>>> Thanks again,
>>> Ken
>>
>> Is spooling mentioned at all.  If so, spooling of file attributes will 
>> be done.
>>
> 
> Nope, no spooling is enabled.  I am backing up to disk, so I didn't 
> enable spooling.
> 
>> http://www.bacula.org/rel-manual/Data_Spooling.html
>>
>> Do you have the emailed job report from this job?  If so, please paste.
>>
> 
> 05-Jan 01:05 ducati-dir JobId 47: No prior Full backup Job record found.
> 05-Jan 01:05 ducati-dir JobId 47: No prior or suitable Full backup found 
> in catalog. Doing FULL backup.
> 06-Jan 11:27 ducati-dir JobId 47: Start Backup JobId 47, 
> Job=Astoria_Web.2008-01-05_01.05.13
> 06-Jan 11:27 ducati-dir JobId 47: Created new Volume "Full-0012" in 
> catalog.
> 06-Jan 11:27 ducati-dir JobId 47: Using Device "FileStorage"
> 06-Jan 11:28 astoria-fd JobId 47: ClientBeforeJob: run command 
> "/usr/local/etc/rc.d/mysql stop"
> 06-Jan 11:28 astoria-fd JobId 47: ClientBeforeJob:  mysqld
> 06-Jan 11:27 ducati-sd JobId 47: Labeled new Volume "Full-0012" on 
> device "FileStorage" (/bacula).
> 06-Jan 11:27 ducati-sd JobId 47: Wrote label to prelabeled Volume 
> "Full-0012" on device "FileStorage" (/bacula)
> 06-Jan 13:36 ducati-sd JobId 47: Job write elapsed time = 02:09:11, 
> Transfer rate = 442.9 K bytes/second
> 06-Jan 13:37 astoria-fd JobId 47: ClientAfterJob: run command 
> "/usr/local/etc/rc.d/mysql start"
> 08-Jan 14:09 ducati-dir JobId 47: Fatal error: Network error with FD 
> during Backup: ERR=Broken pipe


The job has been running for over two days.

> 08-Jan 14:09 ducati-dir JobId 47: Error: Bacula ducati-dir 2.2.7 
> (24Dec07): 08-Jan-2008 14:09:54
>  Build OS:               i386-unknown-freebsd6.2 freebsd 6.2-RELEASE-p8
>  JobId:                  47
>  Job:                    Astoria_Web.2008-01-05_01.05.13
>  Backup Level:           Full (upgraded from Incremental)
>  Client:                 "astoria-fd" 2.2.7 (24Dec07) 
> i386-unknown-freebsd6.2,freebsd,6.2-RELEASE-p9
>  FileSet:                "Astoria Web" 2008-01-01 10:39:25
>  Pool:                   "Full-Pool" (From Job FullPool override)
>  Storage:                "OutsideFile" (From Job resource)
>  Scheduled time:         05-Jan-2008 01:05:00
>  Start time:             06-Jan-2008 11:27:08
>  End time:               08-Jan-2008 14:09:54
>  Elapsed time:           2 days 2 hours 42 mins 46 secs
>  Priority:               10
>  FD Files Written:       44,934
>  SD Files Written:       44,934
>  FD Bytes Written:       3,426,818,210 (3.426 GB)
>  SD Bytes Written:       3,432,986,097 (3.432 GB)
>  Rate:                   18.8 KB/s

With very slow throughput.

>  Software Compression:   None
>  VSS:                    no
>  Encryption:             no
>  Volume name(s):         Full-0012
>  Volume Session Id:      11
>  Volume Session Time:    1199393132
>  Last Volume Bytes:      3,436,447,127 (3.436 GB)
>  Non-fatal FD errors:    0
>  SD Errors:              0
>  FD termination status:  Error
>  SD termination status:  OK
>  Termination:            *** Backup Error ***
> 
> When I saw the "Broken Pipe" error, I search the archives and came up 
> with the "Heartbeat Interval" setting, which has had no effect.

The original post led me to believe you thought the terminated job 
hanging around was a problem.  I did not know that errors had occurred 
during the job.  It's my fault for assuming what you were concerned 
about.  And of course, I didn't read the subject for the other vital 
piece of information.

Slowing down, and reading the "status client" output shows me that the 
client thinks the job is still running.

 From what I can tell, you have a very slow network connection and you 
area backing up rather large amounts of data.  During the backup, a 
network problem occurred.  This led to a disconnection and an error.

Is this throughput level expected by you?  18KB/s is slow.

Is this the throughput you achieved in previous successful runs of this job?

-- 
Dan Langille

BSDCan - The Technical BSD Conference : http://www.bsdcan.org/
PGCon  - The PostgreSQL Conference:     http://www.pgcon.org/

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Re: [Bacula-users] Job Terminated at the director but not the client

Reply via email to