You could try running bacula-fd with debugging output.  Unfortunately, it
doesn't include timestamps, but you can do it like this:

bacula-fd -d50,scheduler -f -v ...your normal bacula-fd args... |  perl -pe 
'use POSIX strftime; print strftime "[%Y-%m-%d %H:%M:%S] ", localtime'

That might show exactly when it gets the "Connection refused" errors and help
to diagnose it.

__Martin


>>>>> On Sat, 23 Jul 2022 12:22:30 +0200, Justin Case said:
> 
> The last two scheduled runs of the job again had the connection errors (and 
> again the the backup was still taken fine). yesterday I even ran the longer 
> running job a few hours ahead to see if this was the reason why the 
> connection error disappeared the other night - but that was not it. Also 
> commenting out the reconnect clause didn’t make a difference.
> 
> the only two things I am aware I can do now to check it out further:
> 
> (1) use a connection schedule for the FD
> (2) downgrade the FD from 13 to 11 (can this really be the cause?)
> 
> > On 20. Jul 2022, at 22:35, Justin Case <jus7inc...@gmail.com> wrote:
> > 
> > 
> > Hey Bill, thanks for spending time on this!
> > 
> >> On 20. Jul 2022, at 21:46, Bill Arlofski via Bacula-users 
> >> <bacula-users@lists.sourceforge.net 
> >> <mailto:bacula-users@lists.sourceforge.net>> wrote:
> >> 
> >> 
> >> Justin,
> >> 
> >> I know what you told us, but I think we have a situation that I (and 
> >> Martin) described:
> > 
> > I understand your experiment, but it is not like that here.
> > 
> >> - FD cannot connect to Director due to firewall
> > 
> > It can.
> > 
> >> - Director CAN connect to FD (I know, I know... :)
> > 
> > It cannot.
> > 
> >> - Job starts, Director connects to FD and receives all the queued "Cannot 
> >> connect" messages
> >> - Job runs fine
> >> 
> >> 
> >> Here is how I tested:
> >> 
> >> - In my FD config I set in the the Director{} block:
> >> 
> >>  - ConnectToDirector = yes
> >>  - A BOGUS IP address for the `Address =` setting
> >> 
> >> 
> >> I killed and restarted the FD in foreground and debug mode, and I see that 
> >> it goes on to try to connect to an IP address that
> >> is not taken on my network....
> >> ----8<----
> >> speedy-fd: events.c:48-0 Events: code=FD0001 daemon=speedy-fd ref=0x238e 
> >> type=daemon source=*Daemon* text=Filed startup
> >> 13.0.0 (04Jul22)
> >> speedy-fd: filed.c:296-0 filed: listening on port 9102
> >> speedy-fd: bnet_server.c:90-0 Addresses 0.0.0.0:9102
> >> speedy-fd: bsockcore.c:354-0 Current 10.1.1.99:9101 All 10.1.1.99:9101
> >> speedy-fd: bsockcor
> >> e.c:443-0 Could not connect to server Director daemon 10.1.1.99:9101. 
> >> ERR=No route to host
> >> speedy-fd: bsockcore.c:253-0 Unable to connect to Director daemon on 
> >> 10.1.1.99:9101. ERR=No route to host
> >> speedy-fd: bsockcore.c:354-0 Current 10.1.1.99:9101 All 10.1.1.99:9101
> >> speedy-fd: bsockcore.c:443-0 Could not connect to server Director daemon 
> >> 10.1.1.99:9101. ERR=No route to host
> >> speedy-fd: bsockcore.c:253-0 Unable to connect to Director daemon on 
> >> 10.1.1.99:9101. ERR=No route to host
> >> ----8<----
> >> 
> >> Meanwhile, from the Director, I do a `status client=xxxx` and BAMM.. 
> >> Director connects to Client and I get the FD's status -
> >> so a Job would also work in this manner.
> >> 
> >> 
> >> From your Director, can you try:
> > 
> > good thinking. This was the first thing I checked when I saw the errors, 
> > though. I usually try everything i can think of before I turn to the 
> > mailing list, but of course you cannot know what I tried, as I did not 
> > mention it.
> > 
> >> # telnet <IP of Client> 9102
> > 
> > Connection refused
> > 
> >> And from the Client:
> >> 
> >> # telnet <IP of Director> 9101
> > 
> > no telnet there, using netcat instead, the connection gets established. I 
> > can write stuff, after some “invalid keywords” the connection is closed by 
> > the director.
> > 
> > to be sure I tried again with other port numbers that have no daemons 
> > running. netcat returns immediately (due to port being closed).
> > 
> >> And show us the results?
> > 
> > see above.
> > 
> > In the mean while the schedule ran again.
> > 
> > today: no connection error messages.
> > 
> > OK OK, but why. What was different? I did some experiments earlier, so the 
> > job did run twice before.
> > 
> > Also I ran another longer running job on the other tier, but actually the 
> > problematic job did not queue up but ran through immediately (so both jobs 
> > ran simultaneously) and no errors were thrown.
> > 
> > When a few minutes ago the schedule started the longer running job also was 
> > started as “incremental” and had no files to be backed up (because it ran a 
> > few hours before and no changes had been made in the fileset.
> > 
> > Finally, I had commented out the Reconnect clause.
> > 
> > Hard to say what was the reason. 
> > 
> > I will observe whether tomorrow it will or will not throw connection  
> > errors and will report back in both cases. And I will not make any 
> > experiments on Bacula before the schedule runs tomorrow.
> > 
> >> 
> >> Thank you!
> >> Bill
> >> 
> >> --
> >> Bill Arlofski
> >> w...@protonmail.com <mailto:w...@protonmail.com>
> >> _______________________________________________
> >> Bacula-users mailing list
> >> Bacula-users@lists.sourceforge.net 
> >> <mailto:Bacula-users@lists.sourceforge.net>
> >> https://lists.sourceforge.net/lists/listinfo/bacula-users 
> >> <https://lists.sourceforge.net/lists/listinfo/bacula-users>
> 


_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to