Hi, So having quite a frustrating issue with bacula on one of our CentOS hosts. Without pattern, I will receive this error on some days:
12-Oct 09:13 dir JobId 24468: Start Backup JobId 24468, Job=<job> 12-Oct 09:13 dir JobId 24468: Using Device "chg0_drive0" 12-Oct 09:46 fd JobId 24468: Fatal error: Authorization key rejected by Storage daemon. Please see http://www.bacula.org/en/rel-manual/Bacula_Freque_Asked_Questi.html#SECTION003760000000000000000 for help. 12-Oct 09:46 fd JobId 24468: Fatal error: Failed to authenticate Storage daemon. 12-Oct 09:46 dir JobId 24468: Fatal error: Bad response to Storage command: wanted 2000 OK storage , got 2902 Bad storage If the backup actually DOES work, it takes forever for the job to complete, even for very little data (connected to 1gb switch). Brief Network Setup: We utilize a lot of VLANs on our network so our bacula server is accessible by a set IP on each VLAN (ex. VLAN id= 25, bacula server listens on x.x.25.100 on that VLAN). All our servers run on XenServer hosts and we've never had an issue with bacula running until this particular host (1.5+ years). Good Host: Below is snippets of debug output from a working host with 0 issues- fd: job.c:263-0 <dird: Hello Director dir calling fd: job.c:279-0 Executing Hello command. fd: cram-md5.c:73-0 send: auth cram-md5 <1146528753.1350051814@fd> ssl=0 fd: cram-md5.c:133-0 cram-get received: auth cram-md5 <1771224273.1350052057@dir> ssl=0 fd: cram-md5.c:152-0 sending resp to challenge: u8/6LAM+5A+Jc4/nzR/NiC fd: job.c:263-0 <dird: JobId=24472 Job=<job name> SDid=1197 SDtime=1347377831 Authorization=<authorization key> >From there, it goes into- fd: job.c:263-0 <dird: storage address=<FQDN of bacula director server> port=9103 ssl=0 fd: job.c:279-0 Executing storage command. fd: job.c:1474-0 StorageCmd: storage address=<FQDN of bacula server>port=9103 ssl=0 fd: bsock.c:220-0 Current host[ipv4:10.10.x.100:9103] All host[ipv4:10.10.x.100:9103] fd: bsock.c:154-0 who=Storage daemon host=<FQDN of bacula director server> port=9103 fd: cram-md5.c:133-0 cram-get received: auth cram-md5 <1575779100.1350052057@sd> ssl=0 fd: cram-md5.c:152-0 sending resp to challenge: B9/fK0xcX6E+K4+oZl/KPB fd: cram-md5.c:80-0 send: auth cram-md5 <786312287.1350051814@ovpn1-fd> ssl=0 fd: cram-md5.c:99-0 Authenticate OK A9/4P+/TBjg+N8dRc4/aPD fd: job.c:263-0 <dird: backup And the job completes in adequate time. Problem Host: The output on the problem host is completely different. First, for connecting, it looks like this: clcd-fd: job.c:263-0 <dird: storage address=<FQDN of bacula server> port=9103 ssl=0 clcd-fd: job.c:279-0 Executing storage command. clcd-fd: job.c:1474-0 StorageCmd: storage address=<FQDN of bacula server> port=9103 ssl=0 clcd-fd: bsock.c:220-0 Current host[ipv4:10.10.5x.100:9103] All host[ipv4:10.10.5x.100:9103] host[ipv4:10.10.5x.100:65535] host[ipv4:10.10.7x.100:65535] host[ipv4:10.10.x.x:65535] host[ipv4:10.10.x.100:65535] host[ipv4:10.10.x.10:65535] host[ipv4:10.10.1x.100:65535] host[ipv4:10.10.2x.100:65535] host[ipv4:10.10.2x.100:65535] host[ipv4:10.10.5x.100:65535] host[ipv4:10.10.5x.100:65535] host[ipv4:10.10.5x.100:65535] host[ipv4:10.10.5x.100:65535] clcd-fd: bsock.c:220-0 Current host[ipv4:10.10.5x.100:9103] All host[ipv4:10.10.5x.100:9103] host[ipv4:10.10.5x.100:9103] host[ipv4:10.10.7x.100:65535] host[ipv4:10.10.x.x:65535] host[ipv4:10.10.x.100:65535] host[ipv4:10.10.x.10065535] host[ipv4:10.10.1x.100:65535] host[ipv4:10.10.2x.100:65535] host[ipv4:10.10.2x.100:65535] host[ipv4:10.10.5x.100:65535] host[ipv4:10.10.5x.100:65535] host[ipv4:10.10.5x.100:65535] host[ipv4:10.10.5x.100:65535] clcd-fd: bsock.c:220-0 Current host[ipv4:10.10.7x.100:9103] All host[ipv4:10.10.5x.100:9103] host[ipv4:10.10.5x.100:9103] host[ipv4:10.10.7x.100:9103] host[ipv4:10.10.x.x:65535] host[ipv4:10.10.x.100:65535] host[ipv4:10.10.x.100:5535] host[ipv4:10.10.1x.100:65535] host[ipv4:10.10.2x.100:65535] host[ipv4:10.10.2x.100:65535] host[ipv4:10.10.5x.100:65535] host[ipv4:10.10.5x.100:65535] host[ipv4:10.10.5x.100:65535] host[ipv4:10.10.5x.100:65535] clcd-fd: bsock.c:220-0 Current host[ipv4:10.10.x.x:9103] All host[ipv4:10.10.5x.100:9103] host[ipv4:10.10.5x.100:9103] host[ipv4:10.10.7x.100:9103] host[ipv4:10.10.x.x:9103] host[ipv4:10.10.x.100:65535] host[ipv4:10.10.x.100:6555] host[ipv4:10.10.1x.100:65535] host[ipv4:10.10.2x.100:65535] host[ipv4:10.10.2x.100:65535] host[ipv4:10.10.5x.100:65535] host[ipv4:10.10.5x.100:65535] host[ipv4:10.10.5x.100:65535] host[ipv4:10.10.5x.100:65535] I changed all the IP's for security purposes, but hopefully you get the idea. Once it connects to the right IP, it passes authorization- fd: job.c:263-0 <dird: storage address=<FQDN of bacula server> port=9103 ssl=0 fd: job.c:279-0 Executing storage command. fd: job.c:1474-0 StorageCmd: storage address=<FQDN of bacula server> port=9103 ssl=0 fd: bsock.c:220-0 Current host[ipv4:10.10.1x.100:9103] All host[ipv4:10.10.1x.100:9103] fd: bsock.c:154-0 who=Storage daemon host=<FQDN of bacula server> port=9103 fd: cram-md5.c:133-0 cram-get received: auth cram-md5 <1356737337.1350055466@chef-sd> ssl=0 fd: cram-md5.c:152-0 sending resp to challenge: I//Wz5ZIkS9l/9+sE8Q8cC fd: cram-md5.c:80-0 send: auth cram-md5 <57904060.1350055466@clcd-fd> ssl=0 fd: cram-md5.c:99-0 Authenticate OK gkUm/T1qS/cEuT/4+4RasB fd: job.c:263-0 <dird: backup Once it finally gets through that stage of the process, I get a ton of these messages (there's a lot more than shown below)- fd: heartbeat.c:96-0 wait_intr=0 stop=0 fd: heartbeat.c:96-0 wait_intr=0 stop=0 fd: heartbeat.c:96-0 wait_intr=0 stop=0 fd: heartbeat.c:96-0 wait_intr=0 stop=0 fd: heartbeat.c:96-0 wait_intr=0 stop=0 fd: heartbeat.c:96-0 wait_intr=0 stop=0 fd: heartbeat.c:96-0 wait_intr=0 stop=0 fd: heartbeat.c:96-0 wait_intr=0 stop=0 fd: heartbeat.c:96-0 wait_intr=0 stop=0 fd: heartbeat.c:96-0 wait_intr=0 stop=0 fd: heartbeat.c:96-0 wait_intr=0 stop=0 fd: heartbeat.c:96-0 wait_intr=0 stop=0 Then eventually - fd: heartbeat.c:143-0 Send kill to heartbeat id fd: backup.c:211-0 end blast_data ok=1 fd: job.c:282-0 Quit command loop. Canceled=0 fd: job.c:388-0 Calling term_find_files fd: job.c:391-0 Done with term_find_files fd: jcr.c:181-0 write_last_jobs seek to 192 fd: job.c:393-0 Done with free_jcr So, after comparing the two, I'm not sure why the problem host tries to connect to every IP on each VLAN that the bacula server runs on, instead of just the appropriate one like the working hosts. We don't have heartbeat configured on any clients. We use all the default ports for bacula. I wasn't able to get debug output for a failure (shocking), but with what I see so far, something is really odd with the networking on the problem host. We have numerous hosts on this VLAN that have no problem connecting to the bacula server. All other VLANs work fine as well. Anyone have any ideas? TIA +---------------------------------------------------------------------- |This was sent by rgrob...@ajboggs.com via Backup Central. |Forward SPAM to ab...@backupcentral.com. +---------------------------------------------------------------------- ------------------------------------------------------------------------------ Don't let slow site performance ruin your business. Deploy New Relic APM Deploy New Relic app performance management and know exactly what is happening inside your Ruby, Python, PHP, Java, and .NET app Try New Relic at no cost today and get our sweet Data Nerd shirt too! http://p.sf.net/sfu/newrelic-dev2dev _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users