You could try running bacula-fd with debugging output. Unfortunately, it doesn't include timestamps, but you can do it like this:
bacula-fd -d50,scheduler -f -v ...your normal bacula-fd args... | perl -pe 'use POSIX strftime; print strftime "[%Y-%m-%d %H:%M:%S] ", localtime' That might show exactly when it gets the "Connection refused" errors and help to diagnose it. __Martin >>>>> On Sat, 23 Jul 2022 12:22:30 +0200, Justin Case said: > > The last two scheduled runs of the job again had the connection errors (and > again the the backup was still taken fine). yesterday I even ran the longer > running job a few hours ahead to see if this was the reason why the > connection error disappeared the other night - but that was not it. Also > commenting out the reconnect clause didn’t make a difference. > > the only two things I am aware I can do now to check it out further: > > (1) use a connection schedule for the FD > (2) downgrade the FD from 13 to 11 (can this really be the cause?) > > > On 20. Jul 2022, at 22:35, Justin Case <jus7inc...@gmail.com> wrote: > > > > > > Hey Bill, thanks for spending time on this! > > > >> On 20. Jul 2022, at 21:46, Bill Arlofski via Bacula-users > >> <bacula-users@lists.sourceforge.net > >> <mailto:bacula-users@lists.sourceforge.net>> wrote: > >> > >> > >> Justin, > >> > >> I know what you told us, but I think we have a situation that I (and > >> Martin) described: > > > > I understand your experiment, but it is not like that here. > > > >> - FD cannot connect to Director due to firewall > > > > It can. > > > >> - Director CAN connect to FD (I know, I know... :) > > > > It cannot. > > > >> - Job starts, Director connects to FD and receives all the queued "Cannot > >> connect" messages > >> - Job runs fine > >> > >> > >> Here is how I tested: > >> > >> - In my FD config I set in the the Director{} block: > >> > >> - ConnectToDirector = yes > >> - A BOGUS IP address for the `Address =` setting > >> > >> > >> I killed and restarted the FD in foreground and debug mode, and I see that > >> it goes on to try to connect to an IP address that > >> is not taken on my network.... > >> ----8<---- > >> speedy-fd: events.c:48-0 Events: code=FD0001 daemon=speedy-fd ref=0x238e > >> type=daemon source=*Daemon* text=Filed startup > >> 13.0.0 (04Jul22) > >> speedy-fd: filed.c:296-0 filed: listening on port 9102 > >> speedy-fd: bnet_server.c:90-0 Addresses 0.0.0.0:9102 > >> speedy-fd: bsockcore.c:354-0 Current 10.1.1.99:9101 All 10.1.1.99:9101 > >> speedy-fd: bsockcor > >> e.c:443-0 Could not connect to server Director daemon 10.1.1.99:9101. > >> ERR=No route to host > >> speedy-fd: bsockcore.c:253-0 Unable to connect to Director daemon on > >> 10.1.1.99:9101. ERR=No route to host > >> speedy-fd: bsockcore.c:354-0 Current 10.1.1.99:9101 All 10.1.1.99:9101 > >> speedy-fd: bsockcore.c:443-0 Could not connect to server Director daemon > >> 10.1.1.99:9101. ERR=No route to host > >> speedy-fd: bsockcore.c:253-0 Unable to connect to Director daemon on > >> 10.1.1.99:9101. ERR=No route to host > >> ----8<---- > >> > >> Meanwhile, from the Director, I do a `status client=xxxx` and BAMM.. > >> Director connects to Client and I get the FD's status - > >> so a Job would also work in this manner. > >> > >> > >> From your Director, can you try: > > > > good thinking. This was the first thing I checked when I saw the errors, > > though. I usually try everything i can think of before I turn to the > > mailing list, but of course you cannot know what I tried, as I did not > > mention it. > > > >> # telnet <IP of Client> 9102 > > > > Connection refused > > > >> And from the Client: > >> > >> # telnet <IP of Director> 9101 > > > > no telnet there, using netcat instead, the connection gets established. I > > can write stuff, after some “invalid keywords” the connection is closed by > > the director. > > > > to be sure I tried again with other port numbers that have no daemons > > running. netcat returns immediately (due to port being closed). > > > >> And show us the results? > > > > see above. > > > > In the mean while the schedule ran again. > > > > today: no connection error messages. > > > > OK OK, but why. What was different? I did some experiments earlier, so the > > job did run twice before. > > > > Also I ran another longer running job on the other tier, but actually the > > problematic job did not queue up but ran through immediately (so both jobs > > ran simultaneously) and no errors were thrown. > > > > When a few minutes ago the schedule started the longer running job also was > > started as “incremental” and had no files to be backed up (because it ran a > > few hours before and no changes had been made in the fileset. > > > > Finally, I had commented out the Reconnect clause. > > > > Hard to say what was the reason. > > > > I will observe whether tomorrow it will or will not throw connection > > errors and will report back in both cases. And I will not make any > > experiments on Bacula before the schedule runs tomorrow. > > > >> > >> Thank you! > >> Bill > >> > >> -- > >> Bill Arlofski > >> w...@protonmail.com <mailto:w...@protonmail.com> > >> _______________________________________________ > >> Bacula-users mailing list > >> Bacula-users@lists.sourceforge.net > >> <mailto:Bacula-users@lists.sourceforge.net> > >> https://lists.sourceforge.net/lists/listinfo/bacula-users > >> <https://lists.sourceforge.net/lists/listinfo/bacula-users> > _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users