Re: [Bacula-users] Incomplete backup - due to bsock error

Jerry Lowry Mon, 25 Sep 2017 11:25:44 -0700

Hi, again!
I hate to return to this but I got the same errors on my other backup
server.  Running the same type of copy job!  Just minutes ago.  This system
is running the same configuration:
Centos 6.9
Linux 2.6.32-696.10.1.el6_x86_64
Mariadb 10.2.8
Bacula 9.0.3


Nothing has changed in the Bacula config files since before the upgrade to
the latest version.

Job {
        Name = "CopyWKDiskToDisk"
        Type = Copy
        Level = Full
        FileSet = "Bottom Set"
        Client = distress-fd
        Messages = Standard
    Storage = workstations
        Pool = WorkstationPool
        Maximum Concurrent Jobs = 4
        Selection Type = PoolUncopiedJobs
        Selection Pattern = "DC-*"
}


# File Pool definition
Pool {
  Name = OffsiteBottom
  Pool Type = Copy
  Next Pool = OffsiteBottom
  Storage = bottomswap
  Recycle = yes                       # Bacula can automatically recycle
Volumes
  AutoPrune = yes                     # Prune expired volumes
  Volume Retention = 30 years         # thirty years
  Maximum Volume Bytes = 1800G          # Limit Volume to disk size
  Maximum Volumes = 10               # Limit number of Volumes in Pool
}


# Definition of file storage device
Storage {
  Name = bottomswap            # offsite disk
# Do not use "localhost" here
  #Address = distress.ACCOUNTING.EDT.LOCAL                # N.B. Use a
fully qua
lified name here
  Address = 10.10.10.3              # N.B. Use a fully qualified name here
  SDPort = 9103
  Password = ""
  Device = BottomSwap
  Media Type = File
}

Device {
  Name = BottomSwap
  Media Type = File
  Archive Device = /BottomSwap
  LabelMedia = yes;                   # lets Bacula label unlabeled media
  Random Access = Yes;
  AutomaticMount = yes;               # when device opened, read it
  RemovableMedia = no;
  AlwaysOpen = no;
}

checked the message log and there are no network errors.

dmesg shows the disk change that I just finished, but there are no errors!

I'm at a loss, as I don't want to keep restarting these backups due to time
constraints with the other backup jobs.

Were there any changes in this part of the code for v 9.0.3?

jerry


On Fri, Sep 22, 2017 at 11:19 AM, Jerry Lowry <michaiah2...@gmail.com>
wrote:

> Yes, kilchis is a bonifide hardware server. Only VM's I have are test
> systems running on my desktop.
>
> There are 2 copy jobs on this system. This particular job is the one that
> typically runs long enough that it will need a new volume during the
> night.  The other one will if it is run late in the day and the current
> volume does not have very much space left on it. The other daily backup
> jobs will wait until the copy job is finished, but there is nothing else
> running on the system that utilizes the network except for VNC traffic.
> This problem happened two weeks in a row and this last week it worked just
> fine.  The one thing that is different is that I dropped all of the current
> backup files and purged them from the DB. I then recreated new files to
> backup to.  Just wondering if one of the files was writing on a
> questionable sector on disk.  Nothing in the logs and smart does not give
> any details on that.
>
> I think I will call it a fluke and keep a watch on it in the future..
> Thanks!
>
> On Fri, Sep 22, 2017 at 10:27 AM, Martin Simmons <mar...@lispworks.com>
> wrote:
>
>> That's odd -- the reading side looks normal to me until the error is
>> detected.
>>
>> Also, "Connection reset by peer" doesn't normally occur when connected to
>> the
>> current machine.
>>
>> Is kilchis a real computer (not a VM)?
>>
>> Is this the only copy job that waits overnight for someone to label a new
>> volume?
>>
>> Maybe something happens overnight on the system that causes networking to
>> be
>> disrupted in some subtle way, causing "Connection reset by peer" when the
>> connection is closed cleanly?
>>
>> __Martin
>>
>>
>> >>>>> On Tue, 19 Sep 2017 15:31:46 -0700, Jerry Lowry said:
>> >
>> > The reading side is the same system.  It is a copy job setup to backup
>> > daily backups to the offsite backup disk.
>> > The attachment is the bacula jobid 35202.
>> >
>> > jerry
>> >
>> > On Tue, Sep 19, 2017 at 10:08 AM, Martin Simmons <mar...@lispworks.com>
>> > wrote:
>> >
>> > > The email below is from the writing side of the copy job and the
>> message:
>> > >
>> > > 13-Sep 08:43 kilchis JobId 35203: Error: bsock.c:849 Read error from
>> > > Storage daemon:kilchis:9103: ERR=Connection reset by peer
>> > >
>> > > shows that the connection to the reading side of the job was closed
>> > > unexpectedly from the reading end.
>> > >
>> > > Do you have the corresponding email from the reading side?  It will
>> have a
>> > > different JobId (but should mention JobId 35203) and should start with
>> > > something like "Using Device ... to read."
>> > >
>> > > __Martin
>> > >
>> > >
>> > > >>>>> On Mon, 18 Sep 2017 13:42:19 -0700, Jerry Lowry said:
>> > > >
>> > > > Martin,
>> > > > Here is the complete email that was sent just before the "Copy
>> Error"
>> > > > message:
>> > > >
>> > > > 12-Sep 15:09 kilchis-dir JobId 35203: Using Device "MidSwap" to
>> write.
>> > > > 12-Sep 15:09 kilchis JobId 35203: Volume "homeMS-200" previously
>> > > written, moving to end of data.
>> > > > 12-Sep 15:27 kilchis JobId 35203: End of medium on Volume
>> "homeMS-200"
>> > > Bytes=1,932,735,274,146 Blocks=29,959,317 at 12-Sep-2017 15:27.
>> > > > 12-Sep 15:28 kilchis JobId 35203: Job BackupUsers.2017-09-12_09.05.0
>> 9_50
>> > > is waiting. Cannot find any appendable volumes.
>> > > > Please use the "label" command to create a new Volume for:
>> > > >     Storage:      "MidSwap" (/MidSwap)
>> > > >     Pool:         OffsiteMid
>> > > >     Media type:   File
>> > > > 12-Sep 15:36 kilchis JobId 35203: Wrote label to prelabeled Volume
>> > > "homeMS-201" on File device "MidSwap" (/MidSwap)
>> > > > 12-Sep 15:36 kilchis JobId 35203: New volume "homeMS-201" mounted on
>> > > device "MidSwap" (/MidSwap) at 12-Sep-2017 15:36.
>> > > > 12-Sep 19:54 kilchis JobId 35203: End of medium on Volume
>> "homeMS-201"
>> > > Bytes=1,932,735,281,790 Blocks=29,959,315 at 12-Sep-2017 19:54.
>> > > > 12-Sep 19:54 kilchis JobId 35203: Job BackupUsers.2017-09-12_09.05.0
>> 9_50
>> > > is waiting. Cannot find any appendable volumes.
>> > > > Please use the "label" command to create a new Volume for:
>> > > >     Storage:      "MidSwap" (/MidSwap)
>> > > >     Pool:         OffsiteMid
>> > > >     Media type:   File
>> > > > 12-Sep 20:57 kilchis JobId 35203: Job BackupUsers.2017-09-12_09.05.0
>> 9_50
>> > > is waiting. Cannot find any appendable volumes.
>> > > > Please use the "label" command to create a new Volume for:
>> > > >     Storage:      "MidSwap" (/MidSwap)
>> > > >     Pool:         OffsiteMid
>> > > >     Media type:   File
>> > > > 12-Sep 23:03 kilchis JobId 35203: Job BackupUsers.2017-09-12_09.05.0
>> 9_50
>> > > is waiting. Cannot find any appendable volumes.
>> > > > Please use the "label" command to create a new Volume for:
>> > > >     Storage:      "MidSwap" (/MidSwap)
>> > > >     Pool:         OffsiteMid
>> > > >     Media type:   File
>> > > > 13-Sep 03:15 kilchis JobId 35203: Job BackupUsers.2017-09-12_09.05.0
>> 9_50
>> > > is waiting. Cannot find any appendable volumes.
>> > > > Please use the "label" command to create a new Volume for:
>> > > >     Storage:      "MidSwap" (/MidSwap)
>> > > >     Pool:         OffsiteMid
>> > > >     Media type:   File
>> > > > 13-Sep 08:23 kilchis JobId 35203: Wrote label to prelabeled Volume
>> > > "homeMS-202" on File device "MidSwap" (/MidSwap)
>> > > > 13-Sep 08:23 kilchis JobId 35203: New volume "homeMS-202" mounted on
>> > > device "MidSwap" (/MidSwap) at 13-Sep-2017 08:23.
>> > > > 13-Sep 08:43 kilchis JobId 35203: Error: bsock.c:849 Read error from
>> > > Storage daemon:kilchis:9103: ERR=Connection reset by peer
>> > > > 13-Sep 08:43 kilchis JobId 35203: Fatal error: append.c:271 Network
>> > > error reading from FD. ERR=Connection reset by peer
>> > > > 13-Sep 08:43 kilchis JobId 35203: Elapsed time=04:56:15, Transfer
>> > > rate=125.6 M Bytes/second
>> > > > 13-Sep 08:43 kilchis JobId 35203: Sending spooled attrs to the
>> Director.
>> > > Despooling 1,533,148,574 bytes ...
>> > > >
>> > > > I don't have the job log. Interestingly, I did not have any
>> problems with
>> > > > this or any other copy job before I upgraded.  I went from 5.2.13 to
>> > > 9.0.3
>> > > > of Bacula and latest version of MySql to Mariadb.  Not saying that
>> this
>> > > is
>> > > > a problem, because I have 5 other copy jobs that work without error
>> > > still.
>> > > > This one just happens to be the biggest one.
>> > > >
>> > > > thanks,
>> > > > jerry
>> > > >
>> > > > On Mon, Sep 18, 2017 at 7:55 AM, Martin Simmons <
>> mar...@lispworks.com>
>> > > > wrote:
>> > > >
>> > > > > A copy job will communicate using TCP between the Bacula
>> daemons.  A
>> > > bsock
>> > > > > error could indicate that bacula-sd closed the connection
>> unexpectedly
>> > > and
>> > > > > I
>> > > > > would expect media errors to be logged.
>> > > > >
>> > > > > Your syslog did include some I/O errors.  Any they caused by
>> something
>> > > > > else?
>> > > > >
>> > > > > Do you have the complete job log (from the Bacula log, not the
>> syslog)?
>> > > > >
>> > > > > __Martin
>> > > > >
>> > > > >
>> > > > > >>>>> On Wed, 13 Sep 2017 09:35:07 -0700, Jerry Lowry said:
>> > > > > >
>> > > > > > Kern,
>> > > > > > My Offsite Backup just failed again on the same drive, different
>> > > disk. It
>> > > > > > failed with the same bsock error.  If the backup is working on
>> the
>> > > same
>> > > > > > system using the copy function, how far out of the network stack
>> > > does it
>> > > > > > go.  My thinking is it does not get out of the application
>> layer.  Is
>> > > > > this
>> > > > > > right?  Why would I get a bsock error?
>> > > > > >
>> > > > > > I have taken a look at the smart data for the disk and they
>> seem to
>> > > be
>> > > > > > running okay. I am getting some sector relocation errors, would
>> that
>> > > > > cause
>> > > > > > the bsock error during a remap?  This procedure has been running
>> > > > > flawlessly
>> > > > > > for many years ( except for human error ).  I am wondering if I
>> > > should
>> > > > > > delete the present disk files and let bacula recreate new ones.
>> > > > > >
>> > > > > > thanks for your help!
>> > > > > >
>> > > > > > jerry
>> > > > > >
>> > > > > >
>> > > > > > On Wed, Sep 6, 2017 at 11:26 PM, Kern Sibbald <k...@sibbald.com
>> >
>> > > wrote:
>> > > > > >
>> > > > > > > Hello,
>> > > > > > >
>> > > > > > > If the job is marked as Incomplete in the catalog ("I" I
>> think),
>> > > then
>> > > > > you
>> > > > > > > can simply restart it and it should pickup where it left
>> off.  If
>> > > not
>> > > > > you
>> > > > > > > must run it again from the beginning.
>> > > > > > >
>> > > > > > > If you are switching devices when one is full during a Job,
>> it is
>> > > > > unlikely
>> > > > > > > you can restore that job when it terminates. I recommend
>> carefully
>> > > > > testing
>> > > > > > > restores on your system.
>> > > > > > >
>> > > > > > > Best regards,
>> > > > > > >
>> > > > > > > Kern
>> > > > > > >
>> > > > > > > On 09/06/2017 05:38 PM, Jerry Lowry wrote:
>> > > > > > >
>> > > > > > > List,
>> > > > > > > I am running, bacula 9.0.3, Mariadb 12.2.8 on Centos 6.9.  I
>> got
>> > > notice
>> > > > > > > last night that my Offsite backup failed due to a bsock
>> error.  My
>> > > > > offsite
>> > > > > > > drives are attached to an ATTO raid card which gives me hot
>> swap
>> > > > > > > capability. This configuration works great as it allows me to
>> hot
>> > > swap
>> > > > > a
>> > > > > > > drive when it fills up with a new drive to continue with.  The
>> > > problem
>> > > > > is
>> > > > > > > included below. The backup that I was doing is to the
>> OffsiteMid
>> > > drive
>> > > > > > > which is mounted as /dev/sde. Is there a way to restart this
>> backup
>> > > > > job or
>> > > > > > > am I left with an incomplete backup going forward.
>> > > > > > >
>> > > > > > > thanks for your help,
>> > > > > > >
>> > > > > > > jerry
>> > > > > > >
>> > > > > > >
>> > > > > > > Sep  5 08:46:01 kilchis bat[4339]: bsock.c:147 Unable to
>> connect to
>> > > > > > > Director dae
>> > > > > > > mon on kilchis:9101. ERR=Connection refused
>> > > > > > > Sep  5 10:37:20 kilchis attocfgd: [CRIT] [ExpressSAS
>> > > > > > > R608,50:01:08:60:00:57:3d:c
>> > > > > > > 0] [FW] RAID Group state now Offline: OffsiteTop
>> > > > > > > Sep  5 10:39:06 kilchis kernel: scsi 5:0:1:0: Direct-Access
>> > >  ATTO
>> > > > > > > Offsite
>> > > > > > > Top00     0001 PQ: 0 ANSI: 5
>> > > > > > > Sep  5 10:39:06 kilchis kernel: sd 5:0:1:0: Attached scsi
>> generic
>> > > sg6
>> > > > > type
>> > > > > > > 0
>> > > > > > > Sep  5 10:39:06 kilchis kernel: sd 5:0:1:0: [sdd] 488366336
>> > > 4096-byte
>> > > > > > > logical bl
>> > > > > > > ocks: (2.00 TB/1.81 TiB)
>> > > > > > > Sep  5 10:39:06 kilchis kernel: sd 5:0:1:0: [sdd] Write
>> Protect is
>> > > off
>> > > > > > > Sep  5 10:39:06 kilchis kernel: sd 5:0:1:0: [sdd] Write cache:
>> > > enabled,
>> > > > > > > read cac
>> > > > > > > he: enabled, doesn't support DPO or FUA
>> > > > > > > Sep  5 10:39:06 kilchis kernel: sd 5:0:1:0: [sdd] 488366336
>> > > 4096-byte
>> > > > > > > logical bl
>> > > > > > > ocks: (2.00 TB/1.81 TiB)
>> > > > > > > Sep  5 10:39:06 kilchis kernel: sdd: unknown partition table
>> > > > > > > Sep  5 10:39:06 kilchis kernel: sd 5:0:1:0: [sdd] 488366336
>> > > 4096-byte
>> > > > > > > logical bl
>> > > > > > > ocks: (2.00 TB/1.81 TiB)
>> > > > > > > Sep  5 10:39:06 kilchis kernel: sd 5:0:1:0: [sdd] Attached
>> SCSI
>> > > disk
>> > > > > > > Sep  5 10:39:35 kilchis kernel: sd 5:0:1:0: [sdd] 488366336
>> > > 4096-byte
>> > > > > > > logical bl
>> > > > > > > ocks: (2.00 TB/1.81 TiB)
>> > > > > > > Sep  5 10:39:35 kilchis kernel: sdd:
>> > > > > > > Sep  5 10:44:54 kilchis kernel: EXT4-fs (sdd): mounted
>> filesystem
>> > > with
>> > > > > > > ordered d
>> > > > > > > ata mode. Opts:
>> > > > > > > Sep  5 11:02:38 kilchis bacula-dir[4373]: bsock.c:537 Socket
>> has
>> > > > > errors=1
>> > > > > > > on cal
>> > > > > > > l to client:10.20.10.21:9101
>> > > > > > > Sep  5 11:02:38 kilchis bacula-dir[4373]: bsock.c:537 Socket
>> has
>> > > > > errors=1
>> > > > > > > on cal
>> > > > > > > l to client:10.20.10.21:9101
>> > > > > > > Sep  5 11:02:38 kilchis bacula-dir[4373]: bsock.c:537 Socket
>> has
>> > > > > errors=1
>> > > > > > > on cal
>> > > > > > > l to client:10.20.10.21:9101
>> > > > > > > Sep  5 11:02:38 kilchis bacula-dir[4373]: bsock.c:537 Socket
>> has
>> > > > > errors=1
>> > > > > > > on cal
>> > > > > > > l to client:10.20.10.21:9101
>> > > > > > > Sep  5 11:02:38 kilchis bacula-dir[4373]: bsock.c:537 Socket
>> has
>> > > > > errors=1
>> > > > > > > on cal
>> > > > > > > l to client:10.20.10.21:9101
>> > > > > > > Sep  5 11:02:38 kilchis bacula-dir[4373]: bsock.c:537 Socket
>> has
>> > > > > errors=1
>> > > > > > > on cal
>> > > > > > > l to client:10.20.10.21:9101
>> > > > > > > Sep  5 11:02:38 kilchis bacula-dir[4373]: bsock.c:537 Socket
>> has
>> > > > > errors=1
>> > > > > > > on cal
>> > > > > > > l to client:10.20.10.21:9101
>> > > > > > > Sep  5 13:45:48 kilchis attocfgd: [CRIT] [ExpressSAS
>> > > > > > > R608,50:01:08:60:00:57:3d:c
>> > > > > > > 0] [FW] RAID Group state now Offline: OffsiteMid
>> > > > > > > Sep  5 13:45:53 kilchis attocfgd: [CRIT] [ExpressSAS
>> > > > > > > R608,50:01:08:60:00:57:3d:c
>> > > > > > > 0] [FW] RAID Group state now Offline: OffsiteTop
>> > > > > > > Sep  5 13:47:52 kilchis kernel: scsi 5:0:1:0: Direct-Access
>> > >  ATTO
>> > > > > > > Offsite
>> > > > > > > Mid00     0001 PQ: 0 ANSI: 5
>> > > > > > > Sep  5 13:47:52 kilchis kernel: sd 5:0:1:0: Attached scsi
>> generic
>> > > sg6
>> > > > > type
>> > > > > > > 0
>> > > > > > > Sep  5 13:47:52 kilchis kernel: sd 5:0:1:0: [sde] 488366336
>> > > 4096-byte
>> > > > > > > logical bl
>> > > > > > > ocks: (2.00 TB/1.81 TiB)
>> > > > > > > Sep  5 13:47:52 kilchis kernel: sd 5:0:1:0: [sde] Write
>> Protect is
>> > > off
>> > > > > > > Sep  5 13:47:52 kilchis kernel: sd 5:0:1:0: [sde] Write cache:
>> > > enabled,
>> > > > > > > read cac
>> > > > > > > he: enabled, doesn't support DPO or FUA
>> > > > > > > Sep  5 13:47:52 kilchis kernel: sd 5:0:1:0: [sde] 488366336
>> > > 4096-byte
>> > > > > > > logical bl
>> > > > > > > ocks: (2.00 TB/1.81 TiB)
>> > > > > > > Sep  5 13:47:52 kilchis kernel: sde: unknown partition table
>> > > > > > > Sep  5 13:47:52 kilchis kernel: sd 5:0:1:0: [sde] 488366336
>> > > 4096-byte
>> > > > > > > logical bl
>> > > > > > > ocks: (2.00 TB/1.81 TiB)
>> > > > > > > Sep  5 13:47:52 kilchis kernel: sd 5:0:1:0: [sde] Attached
>> SCSI
>> > > disk
>> > > > > > > Sep  5 13:48:01 kilchis kernel: EXT4-fs error (device sdd):
>> > > > > > > __ext4_get_inode_loc
>> > > > > > > : unable to read inode block - inode=2, block=1057
>> > > > > > > Sep  5 13:48:01 kilchis kernel: Buffer I/O error on device
>> sdd,
>> > > logical
>> > > > > > > block 0
>> > > > > > > Sep  5 13:48:01 kilchis kernel: lost page write due to I/O
>> error
>> > > on sdd
>> > > > > > > Sep  5 13:48:01 kilchis kernel: EXT4-fs error (device sdd) in
>> > > > > > > ext4_reserve_inode
>> > > > > > > _write: IO failure
>> > > > > > > Sep  5 13:48:01 kilchis kernel: EXT4-fs (sdd): previous I/O
>> error
>> > > to
>> > > > > > > superblock
>> > > > > > > detected
>> > > > > > > Sep  5 13:48:01 kilchis kernel: Buffer I/O error on device
>> sdd,
>> > > logical
>> > > > > > > block 0
>> > > > > > > Sep  5 13:48:01 kilchis kernel: lost page write due to I/O
>> error
>> > > on sdd
>> > > > > > > Sep  5 13:48:06 kilchis kernel: Aborting journal on device
>> sdd-8.
>> > > > > > > Sep  5 13:48:06 kilchis kernel: Buffer I/O error on device
>> sdd,
>> > > logical
>> > > > > > > block 24
>> > > > > > > 3826688
>> > > > > > > Sep  5 13:48:06 kilchis kernel: lost page write due to I/O
>> error
>> > > on sdd
>> > > > > > > Sep  5 13:48:06 kilchis kernel: JBD2: I/O error detected when
>> > > updating
>> > > > > > > journal s
>> > > > > > > uperblock for sdd-8.
>> > > > > > > Sep  5 13:48:08 kilchis kernel: EXT4-fs error (device sdd):
>> > > > > > > ext4_put_super: Coul
>> > > > > > > dn't clean up the journal
>> > > > > > > Sep  5 13:48:08 kilchis kernel: EXT4-fs (sdd): Remounting
>> > > filesystem
>> > > > > > > read-only
>> > > > > > > Sep  5 13:48:44 kilchis kernel: sd 5:0:1:0: [sde] 488366336
>> > > 4096-byte
>> > > > > > > logical bl
>> > > > > > > ocks: (2.00 TB/1.81 TiB)
>> > > > > > > Sep  5 13:48:44 kilchis kernel: sde:
>> > > > > > > Sep  5 13:54:05 kilchis kernel: EXT4-fs (sde): mounted
>> filesystem
>> > > with
>> > > > > > > ordered d
>> > > > > > > ata mode. Opts:
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > ------------------------------------------------------------
>> > > > > ------------------
>> > > > > > > Check out the vibrant tech community on one of the world's
>> most
>> > > > > > > engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > _______________________________________________
>> > > > > > > Bacula-users mailing listBacula-users@lists.
>> > > sourceforge.nethttps://
>> > > > > lists.sourceforge.net/lists/listinfo/bacula-users
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Re: [Bacula-users] Incomplete backup - due to bsock error

Reply via email to