Hello Volker,

On Sunday 17 July 2005 21:29, Volker Sauer wrote:
> Hi Kern,
>
> the bacula-server runs independently of NFS mounts, sorry to say that.
> The bsr files are copied by a cron-job to NFS and all the bacula files
> and mysql are on local disks. The machine which jobs caused the hangs
> is actually independent from NFS, too - at least I do not write or read
> anything from or to NFS on all computers involved in the backup process.
> Well, there are - or there could be - mounts on the server and the clients
> (the machines are running autofs, so you never know if there are mounts
> or not), but bacula is independent of these in the sense that I don't
> read and don't write to NFS. Or do you mean the locks could be caused by
> NFS mounts on the machine which are not directly related to bacula??

No, in principle, Bacula would have to attempt to access the NFS mount.

> Mmh, this could be.... but actually there's no sign of stale NFS
> handles in the logs.....

In one of your earlier emails there were NFS version warning messages. It is 
possible these could be a source of problems, but unlikely.

>
> Last night all jobs ran fine. I excluded bali-root which caused the lock
> twice. Could it be a hard disk failure on the client?

Yes. Bacula would wait a long time for the client.  It all depends on what 
exactly is going wrong.

> Or bad memory in the client? Sometimes I have the feeling, that this
> computer "bali" is a little bit weired, since sometimes cron-jobs segfault
> without a apparent reason. Maybe bacula is affected, too... (could be
> bad memory).

If there is a memory problem, normally the client will fail.  In that case, 
the director will not lock.  If the Client goes into a CPU loop or hangs 
because of a disk error, then it is possible that the director will be locked 
up.

At a minimum, you should reboot the client and run memtest to ensure that the 
memory is good (or at least passes the test).

>
> If I wanted to use the debugger with debug symbols, I'd have to recompile
> bacula since the debian packages provide only the striped binaries. This
> will take a little time, since we're running in production.... So I
> really hope, that the error comes from the client....
> I'll keep track of things and I'll try to get a debug-version into
> production in case the director locks up again. I'll keep the list
> up-to-date...

You could try running it under the debugger without debugging symbols. This is 
not ideal, but if it is an internal deadlock, I should be able to see it.  
That might avoid you having to spend the time to rebuild it.  Unfortunately 
without manually running it under the debugger and getting some form of 
traceback, there isn't much I can do to resolve it.

>
> Thanks for your help (so far)!
> Regards
> Volker
>
> On So, 17 Jul 2005, Kern Sibbald wrote:
> > Hello Volker,
> >
> > About the only thing I can think of is that you have a stale or bad NFS
> > connection and you are trying to write the bootstrap file to another
> > machine with the bad NFS link -- or perhaps the other machine is just
> > down.  In that case, Bacula will hang forever.  Don't blame me -- I don't
> > know why NFS files when there is no one on the other end block forever.
> >
> > If that is not the case, about the only solution is for you to run the
> > director manually under the debugger. When it locks up, ctl-c it, then
> > proceed with getting a traceback using the instructions in the Kaboom
> > chapter (I mainly need the output from "thread apply all bt".  Make sure
> > you have debug symbols turned on (i.e. compiled with -g and not stripped.
> >  Note FreeBSD has a habit of stripping everything it installs).
> >
> > On Saturday 16 July 2005 12:02, Volker Sauer wrote:
> > > On Fr, 15 Jul 2005, Volker Sauer wrote:
> > > > On Fr, 15 Jul 2005, Arno Lehmann wrote:
> > > > > >I'll upgrade to 1.36.3 and see what happens. Maybe "Fix deadlock
> > > > > > in multiple simultaneous jobs." (from ReleaseNotes) could be the
> > > > > > right one. I already setup this site with 1.36.3 FileFormat
> > > > > > because I knew it's going to be required!
> > > > >
> > > > > I had the same problem of a locking DIR, which worked ok after a
> > > > > restart, and I could never find a reason (partly because I never
> > > > > investigated with gdb, but that's beyond my skills and as long as I
> > > > > could restart my backups rather easily that was ok).
> > > > > With 1.36.3 this problem vanished.
> > > > > Until yesterday.
> > > >
> > > > Yes, the same with me. I upgraded to 1.36.3 and the problem occured
> > > > again, yesterday.
> > > > Now I setup "trace on" and "setdebug 100" for dir and sd and I'm
> > > > waiting for the problem to occur again!
> > >
> > > Last night, the director locked up again. (See traces attached).
> > > The job "paris-home.archived" was finished. The jobs "paris-home.guest"
> > > and "paris-home.staff.1" are stuck in the holding-disk, because the
> > > director locked up as the job "bali-rootfs" started - nothing was
> > > spooled from bali-rootfs, the director seemed to be stuck immediately.
> > > Btw: The director Maximum Concurrent Jobs = 6 and, the client is
> > > usually set to Maximum Concurrent Jobs = 1 except the host paris, where
> > > it is 2. The storage daemon is set to Maximum Concurrent Jobs = 20.
> > >
> > > An interesting thing is: again it's the job bali-rootfs the causes the
> > > director to lock up. I'll exclude this job for a few days and see if
> > > the director still locks up. Plus, I'll set the debuglevel to 200.
> > >
> > > I've attach backup-dir.conmsg and bacula.trace (level 100). I don't see
> > > anything unusual in bacula.trace.
> > >
> > > Btw: the first part of bacula.trace are the jobs of the night before
> > > last night. They finished without problems. The trace of last night
> > > seems to start around line 518. At the end of the file in line 722 I
> > > tried to connect with bconsole. The connect timed out with no entry in
> > > the logfile.
> > >
> > > I cleared the kernel ringbuffer yesterday so in case any hardware or
> > > bus-problems occur, the should be error. There's only:
> > >
> > > ---------------
> > > nfs warning: mount version older than kernel
> > > nfs warning: mount version older than kernel
> > > APIC error on CPU1: 02(02)
> > > APIC error on CPU0: 02(02)
> > > nfs warning: mount version older than kernel
> > > nfs warning: mount version older than kernel
> > > nfs warning: mount version older than kernel
> > > nfs warning: mount version older than kernel
> > > nfs warning: mount version older than kernel
> > > nfs warning: mount version older than kernel
> > > nfs warning: mount version older than kernel
> > > nfs warning: mount version older than kernel
> > > nfs warning: mount version older than kernel
> > > nfs warning: mount version older than kernel
> > > nfs warning: mount version older than kernel
> > > nfs warning: mount version older than kernel
> > > ----------------
> > >
> > > That's all.
> > >
> > > I hope you can see something in the logs, that I missed!
> >
> > --
> > Best regards,
> >
> > Kern
> >
> >   (">
> >   /\
> >   V_V

-- 
Best regards,

Kern

  (">
  /\
  V_V


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to