-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Getting closer on this, but not quite there yet. More info below

Arno Lehmann wrote:
> Hi,
> 
> On 4/9/2007 10:23 PM, Michael Proto wrote:
> Sorry I'm so late in replying back to this thread, I've had a number of
> long-running full backups running on my Bacula 2.0.3 server that I
> didn't want to touch until completed.
> 
>> I know the feeling...
> 
> Now that they are done I have some
> time to continue investigating this issue.
> 
> Arno Lehmann wrote:
>>>> Hi,
>>>>
>>>> On 4/3/2007 12:04 AM, Michael Proto wrote:
> ...
>>>> I built and packaged all the Bacula Linux clients myself (so they all
>>>> pull from the same set of config files for quick installation), and I
>>>> used the following compile-time flags when building them:
>>>>
>>>> --with-openssl --enable-client-only --enable-static-fd --enable-smartalloc
>>>>
>>>> I'm using the static-bacula-fd binary (instead of the bacula-fd binary)
>>>>
>>>>> Have you checked that the binary is really completely static? Lat time I 
>>>>> tried I could not create a static binary of the FD, the one I could come 
>>>>> up with was created under a really old linux/gcc/glibc combination, and 
>>>>> lacked ACL support.
>>>>> If your binary pulls in shared objects, version differences there might 
>>>>> be related to the problem.
>>>>> Of course, trying the heartbeat interval setting will be the most useful 
>>>>> first step. I'm can't prove that, but I have the impression that Bacula 
>>>>> reacts more sensitive to network problems recently. Might be because 
>>>>> it's more efficient, and so leaves the TCP connections idling longer, or 
>>>>> something...
> Regarding static client binaries, when building the client with the
> "--enable-client-only --enable-static-fd" flags, 2 binaries were
> produced, bacula-fd and static-bacula-fd. Running file and ldd on the
> static binary do seem to indicate that it is compiled statically:
> 
> [EMAIL PROTECTED] ~]# file /sbin/static-bacula-fd
> /sbin/static-bacula-fd: ELF 32-bit LSB executable, Intel 80386, version
> 1 (SYSV), for GNU/Linux 2.2.0, statically linked, stripped
> 
>> This looks good. Better than what I experienced.
> 
> [EMAIL PROTECTED] ~]# ldd /sbin/static-bacula-fd
>         not a dynamic executable
> 
> Even stranger, this error also occurs intermittently on the bacula-fd
> program running on my Bacula server, which was not compiled staticly.
> 
>> Ok, so I think we can rule out a version mismatch between OS/kernel, 
>> system libraries, and user libraries.
> 
> On the network/heartbeat issue:
> The strange thing is, when I initiate a backup job against an affected
> client, it fails immediately, before sending much of anything to the SD
> with the "Fatal error: Socket error on Storage command: ERR=No data
> available" error. Could that really be related to heartbeat?
> 
>> It does not sound like that.
> 
> In any case, I've added the following to the bacula-sd.conf:
> 
>   Heartbeat Interval = 60
> 
> And I'll try adding the same interval to a few affected clients today
> and see if that helps.
> 
>> Watch out when you are prompted for tape changes - I experienced some 
>> problems (not yet reported as a bug, but soon - hopefully).
> 
> I'll also see if I can get a valid tcpdump of the client and server
> communications to see exactly what sorts of packets are being sent when
> this failure occurs.
> 
>> Yep, that's I'd try, too, although this is tricky... you run tcpdump for 
>> days, produce tons of output, and no error occurs... and then, when the 
>> error occors, it's on another machine, your tcpdump crashed because the 
>> file system for the log ws full, or some i**** killed it because he 
>> thought it was a forgotten process :-)
> 
>> If you get a usable dump, you've still got to read and understand it, 
>> which can be new challenge...


Well, I added an additional stderr redirect on one client, so the
bacula-fd application runs with the following arguments (I run it in
foreground mode out of inittab, hence the -f flag):

/sbin/static-bacula-fd -fvc -d100 /etc/bacula/bacula-fd.conf
>/tmp/bacula-fd.out 2>/tmp/bacula-fd.err

Good news is, I was able to immediately reproduce the error with this
client when I ran an ad-hoc incremental backup this morning. The bad
news is, neither the /tmp/bacula-fd.out nor the /tmp/bacula-fd.err file
have any data in them:

svn:mproto/ $ ls -l /tmp/bacula*
- -rw-r--r-- 1 root root 0 2007-04-13 06:35 /tmp/bacula-fd.err
- -rw-r--r-- 1 root root 0 2007-04-13 06:35 /tmp/bacula-fd.out

Timestamps for both files were updated when this job ran, but nothing
seems to appear in the files. Am I using the -d flag correctly? I don't
get why its not outputting anything to either the stdout or the stderr file.


On a hunch, I went ahead and started a packet capture ala tcpdump on the
client just before starting this backup, and I do now have a pcap file
that contains the client/server communications when the error occurred.
Since I have "Reschedule Interval = 90 seconds" in my job definition,
job restarted and that time it worked as expected, which is also in my
pcap file (thankfully, separated by 90 seconds so I can see what a
successful transfer looks like versus an unsuccessful one).

I'm still combing over this pcap file (not much data in it as the
incremental backup only backed-up 2 files worth 95kb when it was
successful), but I do see that when it failed, the -fd on the client
never actually made a connection back to port 9103 (for the -sd
connection) on the server. 90 seconds later, during the 2nd run, I do
see a successful connection from the client to port 9103 on the server.
This leads me to believe that the interruption is happening before the
fd can connect back to the sd. Where to go from here I do not yet know.

Here's a sample of the communication during the failed connection
(192.168.14.122 being the client and 192.168.253.199 being the server):


06:35:41.982020 IP 192.168.253.199.45164 > 192.168.14.122.9102: S
2958123119:2958123119(0) win 5840 <mss 1460,sackOK,timestamp 1042418939
0,nop,wscale 0>
06:35:41.982050 IP 192.168.14.122.9102 > 192.168.253.199.45164: S
467981338:467981338(0) ack 2958123120 win 5792 <mss
1460,sackOK,timestamp 156766135 1042418939,nop,wscale 2>
06:35:41.986029 IP 192.168.253.199.45164 > 192.168.14.122.9102: . ack 1
win 5840 <nop,nop,timestamp 1042418945 156766135>
06:35:41.986078 IP 192.168.253.199.45164 > 192.168.14.122.9102: P 1:5(4)
ack 1 win 5840 <nop,nop,timestamp 1042418945 156766135>
06:35:41.986112 IP 192.168.14.122.9102 > 192.168.253.199.45164: . ack 5
win 1448 <nop,nop,timestamp 156766139 1042418945>
06:35:41.989333 IP 192.168.253.199.45164 > 192.168.14.122.9102: P
5:41(36) ack 1 win 5840 <nop,nop,timestamp 1042418950 156766139>
06:35:41.989380 IP 192.168.14.122.9102 > 192.168.253.199.45164: . ack 41
win 1448 <nop,nop,timestamp 156766142 1042418950>
06:35:41.989528 IP 192.168.14.122.9102 > 192.168.253.199.45164: P 1:5(4)
ack 41 win 1448 <nop,nop,timestamp 156766142 1042418950>
06:35:41.991046 IP 192.168.253.199.45164 > 192.168.14.122.9102: . ack 5
win 5840 <nop,nop,timestamp 1042418952 156766142>
06:35:41.991054 IP 192.168.14.122.9102 > 192.168.253.199.45164: P
5:56(51) ack 41 win 1448 <nop,nop,timestamp 156766144 1042418952>
06:35:41.992207 IP 192.168.253.199.45164 > 192.168.14.122.9102: . ack 56
win 5840 <nop,nop,timestamp 1042418953 156766144>
06:35:41.992427 IP 192.168.253.199.45164 > 192.168.14.122.9102: P
41:45(4) ack 56 win 5840 <nop,nop,timestamp 1042418953 156766144>

...

06:35:43.000346 IP 192.168.14.122.9102 > 192.168.253.199.45164: . ack
743 win 1448 <nop,nop,timestamp 156767153 1042419885>
06:35:43.001682 IP 192.168.253.199.45164 > 192.168.14.122.9102: P
743:766(23) ack 238 win 5840 <nop,nop,timestamp 1042419929 156767153>
06:35:43.001732 IP 192.168.14.122.9102 > 192.168.253.199.45164: . ack
766 win 1448 <nop,nop,timestamp 156767155 1042419929>
06:35:43.001843 IP 192.168.14.122.9102 > 192.168.253.199.45164: P
238:242(4) ack 766 win 1448 <nop,nop,timestamp 156767155 1042419929>
06:35:43.036407 IP 192.168.253.199.45164 > 192.168.14.122.9102: . ack
242 win 5840 <nop,nop,timestamp 1042419970 156767155>
06:35:43.036505 IP 192.168.14.122.9102 > 192.168.253.199.45164: P
242:401(159) ack 766 win 1448 <nop,nop,timestamp 156767190 1042419970>
06:35:43.037608 IP 192.168.253.199.45164 > 192.168.14.122.9102: . ack
401 win 6432 <nop,nop,timestamp 1042419972 156767190>
06:35:43.038345 IP 192.168.253.199.45164 > 192.168.14.122.9102: P
766:770(4) ack 401 win 6432 <nop,nop,timestamp 1042419972 156767190>
06:35:43.078234 IP 192.168.14.122.9102 > 192.168.253.199.45164: . ack
770 win 1448 <nop,nop,timestamp 156767231 1042419972>
06:35:43.079358 IP 192.168.253.199.45164 > 192.168.14.122.9102: P
770:827(57) ack 401 win 6432 <nop,nop,timestamp 1042420020 156767231>
06:35:43.079424 IP 192.168.14.122.9102 > 192.168.253.199.45164: . ack
827 win 1448 <nop,nop,timestamp 156767232 1042420020>
06:35:43.083707 IP 192.168.14.122.9102 > 192.168.253.199.45164: F
401:401(0) ack 827 win 1448 <nop,nop,timestamp 156767237 1042420020>
06:35:43.120329 IP 192.168.253.199.45164 > 192.168.14.122.9102: . ack
402 win 6432 <nop,nop,timestamp 1042420066 156767237>
06:35:47.215667 IP 192.168.253.199.45164 > 192.168.14.122.9102: F
827:827(0) ack 402 win 6432 <nop,nop,timestamp 1042424164 156767237>
06:35:47.215708 IP 192.168.14.122.9102 > 192.168.253.199.45164: . ack
828 win 1448 <nop,nop,timestamp 156771369 1042424164>

And then it dies, with the error:
Fatal error: Socket error on Storage command: ERR=No data available

90 seconds later, it restarts...

06:37:20.867481 IP 192.168.253.199.45167 > 192.168.14.122.9102: S
3078819506:3078819506(0) win 5840 <mss 1460,sackOK,timestamp 1042517754
0,nop,wscale 0>
06:37:20.867541 IP 192.168.14.122.9102 > 192.168.253.199.45167: S
577555412:577555412(0) ack 3078819507 win 5792 <mss
1460,sackOK,timestamp 156865035 1042517754,nop,wscale 2>
06:37:20.870573 IP 192.168.253.199.45167 > 192.168.14.122.9102: . ack 1
win 5840 <nop,nop,timestamp 1042517757 156865035>
06:37:20.870619 IP 192.168.253.199.45167 > 192.168.14.122.9102: P 1:5(4)
ack 1 win 5840 <nop,nop,timestamp 1042517757 156865035>
06:37:20.870681 IP 192.168.14.122.9102 > 192.168.253.199.45167: . ack 5
win 1448 <nop,nop,timestamp 156865039 1042517757>
06:37:20.871644 IP 192.168.253.199.45167 > 192.168.14.122.9102: P
5:41(36) ack 1 win 5840 <nop,nop,timestamp 1042517760 156865039>
06:37:20.871690 IP 192.168.14.122.9102 > 192.168.253.199.45167: . ack 41
win 1448 <nop,nop,timestamp 156865040 1042517760>
06:37:20.871854 IP 192.168.14.122.9102 > 192.168.253.199.45167: P 1:5(4)
ack 41 win 1448 <nop,nop,timestamp 156865040 1042517760>
06:37:20.873314 IP 192.168.253.199.45167 > 192.168.14.122.9102: . ack 5
win 5840 <nop,nop,timestamp 1042517761 156865040>
06:37:20.873326 IP 192.168.14.122.9102 > 192.168.253.199.45167: P
5:55(50) ack 41 win 1448 <nop,nop,timestamp 156865041 1042517761>
06:37:20.874664 IP 192.168.253.199.45167 > 192.168.14.122.9102: . ack 55
win 5840 <nop,nop,timestamp 1042517764 156865041>
06:37:20.875295 IP 192.168.253.199.45167 > 192.168.14.122.9102: P
41:45(4) ack 55 win 5840 <nop,nop,timestamp 1042517764 156865041>

...

06:37:21.816698 IP 192.168.253.199.45167 > 192.168.14.122.9102: P
743:766(23) ack 237 win 5840 <nop,nop,timestamp 1042518725 156865983>
06:37:21.816750 IP 192.168.14.122.9102 > 192.168.253.199.45167: . ack
766 win 1448 <nop,nop,timestamp 156865985 1042518725>
06:37:21.817566 IP 192.168.14.122.9102 > 192.168.253.199.45167: P
237:241(4) ack 766 win 1448 <nop,nop,timestamp 156865986 1042518725>
06:37:21.884942 IP 192.168.253.199.45167 > 192.168.14.122.9102: . ack
241 win 5840 <nop,nop,timestamp 1042518768 156865986>
06:37:21.885041 IP 192.168.14.122.9102 > 192.168.253.199.45167: P
241:400(159) ack 766 win 1448 <nop,nop,timestamp 156866053 1042518768>
06:37:21.886589 IP 192.168.253.199.45167 > 192.168.14.122.9102: . ack
400 win 6432 <nop,nop,timestamp 1042518770 156866053>
06:37:21.887233 IP 192.168.253.199.45167 > 192.168.14.122.9102: P
766:770(4) ack 400 win 6432 <nop,nop,timestamp 1042518771 156866053>
06:37:21.927367 IP 192.168.14.122.9102 > 192.168.253.199.45167: . ack
770 win 1448 <nop,nop,timestamp 156866095 1042518771>
06:37:21.929277 IP 192.168.253.199.45167 > 192.168.14.122.9102: P
770:827(57) ack 400 win 6432 <nop,nop,timestamp 1042518822 156866095>
06:37:21.929345 IP 192.168.14.122.9102 > 192.168.253.199.45167: . ack
827 win 1448 <nop,nop,timestamp 156866097 1042518822>
06:37:21.931531 IP 192.168.14.122.57572 > 192.168.253.199.9103: S
573973565:573973565(0) win 5840 <mss 1460,sackOK,timestamp 156866100
0,nop,wscale 2>
06:37:21.933500 IP 192.168.253.199.9103 > 192.168.14.122.57572: S
3067793026:3067793026(0) ack 573973566 win 5792 <mss
1460,sackOK,timestamp 1042518826 156866100,nop,wscale 0>
06:37:21.933527 IP 192.168.14.122.57572 > 192.168.253.199.9103: . ack 1
win 1460 <nop,nop,timestamp 156866102 1042518826>
06:37:21.933619 IP 192.168.14.122.57572 > 192.168.253.199.9103: P 1:5(4)
ack 1 win 1460 <nop,nop,timestamp 156866102 1042518826>
06:37:21.934695 IP 192.168.253.199.9103 > 192.168.14.122.57572: . ack 5
win 5792 <nop,nop,timestamp 1042518828 156866102>
06:37:21.934747 IP 192.168.14.122.57572 > 192.168.253.199.9103: P
5:45(40) ack 1 win 1460 <nop,nop,timestamp 156866103 1042518828>
06:37:21.936898 IP 192.168.253.199.9103 > 192.168.14.122.57572: . ack 45
win 5792 <nop,nop,timestamp 1042518830 156866103>
06:37:21.979358 IP 192.168.253.199.9103 > 192.168.14.122.57572: P 1:5(4)
ack 45 win 5792 <nop,nop,timestamp 1042518882 156866103>
06:37:21.979479 IP 192.168.14.122.57572 > 192.168.253.199.9103: . ack 5
win 1460 <nop,nop,timestamp 156866148 1042518882>


And as you can see here, the fd now makes a connection to port 9103 on
the server, and everything works just fine.


I don't get it. I'm going to look over this pcap file some more and see
if I can find *anything* there, but since the fd isn't making an
outbound connection to the sd when it fails (at least not one that gets
as far as the network interface card), I'm guessing this won't yield
many hints.

Any more ideas on getting the -fd to actually spit-out debug information
is very welcome, and thanks again for all the help!


- -Proto

- --
Michael Proto            | SecureWorks
Unix Administrator       |
PGP ID: 5D575BBE         | [EMAIL PROTECTED]
*******************************************************
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (FreeBSD)

iD8DBQFGHQV5OLq/wl1XW74RAjqNAJ0Rf94shmVGYbcWKtC9JyPm9mBUuwCcCXJ4
IfvGjqLoYfL6faDmaxwDE04=
=Zt5z
-----END PGP SIGNATURE-----

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to