Hi folks I've been having a lot of problems with Bacula's disk volume management over time. Most issues seem to stem from cases where the catalog gets out of sync with the file system, like:
- Volume inserted into catalog, creation of file on disk fails, volume remains in catalog - Write to volume fails, resulting in volume that's shorter on disk than the catalog thinks it should be because the catalog is updated with the *expected* size assuming the write is successful. Further backups try to use this volume and fail. Restores from this volume fail. ARgh! The latest issue is another catalog<>disk discrepency. At some point Bacula seems to have failed to create some auto-labeled volumes, but has still inserted them into the catalog. This might've been due to a transient disk-full situation, but it's truly hard to know. The catalog shows some volumes that don't actually exist on disk: > +---------+------------------------+-----------+---------+---------------+----------+--------------+---------+------+-----------+--------------------+---------------------+ > | mediaid | volumename | volstatus | enabled | volbytes | > volfiles | volretention | recycle | slot | inchanger | mediatype | > lastwritten | > +---------+------------------------+-----------+---------+---------------+----------+--------------+---------+------+-----------+--------------------+---------------------+ > | 158 | HotProductionIncr-0158 | Recycle | 1 | 1 | > 0 | 1,209,600 | 1 | 0 | 0 | File_HotProduction | > 2010-04-21 17:46:51 | > | 165 | HotProductionIncr-0165 | Used | 1 | 2,129,594,337 | > 0 | 1,209,600 | 1 | 0 | 0 | File_HotProduction | > 2010-04-22 14:00:22 | > | 172 | HotProductionIncr-0172 | Used | 1 | 486,771,914 | > 0 | 1,209,600 | 1 | 0 | 0 | File_HotProduction | > 2010-04-23 13:00:31 | > | 178 | HotProductionIncr-0178 | Used | 1 | 2,698,541 | > 0 | 1,209,600 | 1 | 0 | 0 | File_HotProduction | > 2010-04-24 13:00:19 | > | 186 | HotProductionIncr-0186 | Used | 1 | 622,259 | > 0 | 1,209,600 | 1 | 0 | 0 | File_HotProduction | > 2010-04-24 23:05:16 | > | 196 | HotProductionIncr-0196 | Used | 1 | 1,240,182 | > 0 | 1,209,600 | 1 | 0 | 0 | File_HotProduction | > 2010-04-26 23:05:09 | > | 202 | HotProductionIncr-0202 | Used | 1 | 1,634,621,778 | > 0 | 1,209,600 | 1 | 0 | 0 | File_HotProduction | > 2010-04-28 13:00:21 | > | 203 | HotProductionIncr-0203 | Used | 1 | 3,037,665,985 | > 0 | 1,209,600 | 1 | 0 | 0 | File_HotProduction | > 2010-04-29 14:00:23 | > | 204 | HotProductionIncr-0204 | Used | 1 | 695,723,567 | > 0 | 1,209,600 | 1 | 0 | 0 | File_HotProduction | > 2010-04-30 13:00:55 | > | 205 | HotProductionIncr-0205 | Used | 1 | 8,473,730,759 | > 1 | 1,209,600 | 1 | 0 | 0 | File_HotProduction | > 2010-05-06 14:01:11 | > | 206 | HotProductionIncr-0206 | Used | 1 | 4,679,398,060 | > 1 | 1,209,600 | 1 | 0 | 0 | File_HotProduction | > 2010-05-07 13:00:55 | > | 207 | HotProductionIncr-0207 | Used | 1 | 107,414,349 | > 0 | 1,209,600 | 1 | 0 | 0 | File_HotProduction | > 2010-05-08 23:05:05 | > | 208 | HotProductionIncr-0208 | Used | 1 | 519,455 | > 0 | 1,209,600 | 1 | 0 | 0 | File_HotProduction | > 2010-05-08 23:05:48 | > | 209 | HotProductionIncr-0209 | Used | 1 | 711,016,317 | > 0 | 1,209,600 | 1 | 0 | 0 | File_HotProduction | > 2010-05-10 23:06:23 | > | 210 | HotProductionIncr-0210 | Used | 1 | 1,723,979,638 | > 0 | 1,209,600 | 1 | 0 | 0 | File_HotProduction | > 2010-05-12 10:01:34 | > | 211 | HotProductionIncr-0211 | Used | 1 | 5,953,774,081 | > 1 | 1,209,600 | 1 | 0 | 0 | File_HotProduction | > 2010-05-13 12:02:58 | > | 212 | HotProductionIncr-0212 | Append | 1 | 0 | > 0 | 1,209,600 | 1 | 0 | 0 | File_HotProduction | > | > | 213 | HotProductionIncr-0213 | Append | 1 | 0 | > 0 | 1,209,600 | 1 | 0 | 0 | File_HotProduction | > | > | 214 | HotProductionIncr-0214 | Append | 1 | 0 | > 0 | 1,209,600 | 1 | 0 | 0 | File_HotProduction | > | > | 215 | HotProductionIncr-0215 | Append | 1 | 0 | > 0 | 1,209,600 | 1 | 0 | 0 | File_HotProduction | > | > | 216 | HotProductionIncr-0216 | Append | 1 | 0 | > 0 | 1,209,600 | 1 | 0 | 0 | File_HotProduction | > | > | 217 | HotProductionIncr-0217 | Append | 1 | 0 | > 0 | 1,209,600 | 1 | 0 | 0 | File_HotProduction | > | > | 218 | HotProductionIncr-0218 | Append | 1 | 0 | > 0 | 1,209,600 | 1 | 0 | 0 | File_HotProduction | > | > | 219 | HotProductionIncr-0219 | Append | 1 | 0 | > 0 | 1,209,600 | 1 | 0 | 0 | File_HotProduction | > | > | 220 | HotProductionIncr-0220 | Append | 1 | 0 | > 0 | 1,209,600 | 1 | 0 | 0 | File_HotProduction | > | > +---------+------------------------+-----------+---------+---------------+----------+--------------+---------+------+-----------+--------------------+---------------------+ where any with "lastwritten" blank, volbytes 0, volfiles 0 are missing in the file system: > cr...@backup:$ sudo ls -l /srv/bacula-main/HotProduction > total 603766036 > -rw-r----- 1 bacula tape 168740858003 2010-04-20 17:37 HotProductionFull-0154 > -rw-r----- 1 bacula tape 166859921967 2010-04-26 03:13 HotProductionFull-0191 > -rw-r----- 1 bacula tape 42450993346 2010-05-05 19:18 HotProductionFull-0199 > -rw-r----- 1 bacula tape 206717382464 2010-05-10 03:38 HotProductionFull-0250 > -rw-r----- 1 bacula tape 3848435102 2010-04-21 17:46 HotProductionIncr-0158 > -rw-r----- 1 bacula tape 2129594337 2010-04-22 14:00 HotProductionIncr-0165 > -rw-r----- 1 bacula tape 486771914 2010-04-23 13:00 HotProductionIncr-0172 > -rw-r----- 1 bacula tape 2698541 2010-04-24 13:00 HotProductionIncr-0178 > -rw-r----- 1 bacula tape 622259 2010-04-24 23:05 HotProductionIncr-0186 > -rw-r----- 1 bacula tape 1240182 2010-04-26 23:05 HotProductionIncr-0196 > -rw-r----- 1 bacula tape 1634621778 2010-04-28 13:00 HotProductionIncr-0202 > -rw-r----- 1 bacula tape 3037665985 2010-04-29 14:00 HotProductionIncr-0203 > -rw-r----- 1 bacula tape 695723567 2010-04-30 13:00 HotProductionIncr-0204 > -rw-r----- 1 bacula tape 8473730759 2010-05-06 14:01 HotProductionIncr-0205 > -rw-r----- 1 bacula tape 4679398060 2010-05-07 13:00 HotProductionIncr-0206 > -rw-r----- 1 bacula tape 107414349 2010-05-08 23:05 HotProductionIncr-0207 > -rw-r----- 1 bacula tape 519455 2010-05-08 23:05 HotProductionIncr-0208 > -rw-r----- 1 bacula tape 711016317 2010-05-10 23:06 HotProductionIncr-0209 > -rw-r----- 1 bacula tape 1723979638 2010-05-12 10:01 HotProductionIncr-0210 > -rw-r----- 1 bacula tape 5953774081 2010-05-13 12:02 HotProductionIncr-0211 and when a job tries to use them, it seems to get stuck, with the sd periodically reporting the error: 18-May 12:20 backup-sd JobId 854: Warning: mount.c:221 Open device "FileStorage_HotProduction" (/srv/bacula-main/HotProduction) Volume "HotProductionIncr-0212" failed: ERR=dev.c:549 Could not open: /srv/bacula-main/HotProduction/HotProductionIncr-0212, ERR=No such file or directory leaving a huge queue of jobs that use the stame storage stuck behind it: > 853 Increme BucketActiveData.2010-05-13_13.00.00_05 is running > 854 Increme FlouryProductionAndPongrassData.2010-05-13_13.00.00_06 is > waiting for a mount request > 855 Increme PascalProductionAndPongrassData.2010-05-13_13.00.00_07 is > running > 856 Increme BucketActiveData.2010-05-13_14.00.00_08 is waiting on max > Client jobs > 857 Increme FlouryProductionAndPongrassData.2010-05-13_14.00.00_09 is > waiting on max Client jobs > 858 Increme PascalProductionAndPongrassData.2010-05-13_14.00.00_10 is > waiting on max Client jobs > 861 Increme BucketActiveData.2010-05-13_23.05.00_13 is waiting on max > Client jobs > 862 Increme FlouryProductionAndPongrassData.2010-05-13_23.05.00_14 is > waiting on max Client jobs > 863 Increme PascalProductionAndPongrassData.2010-05-13_23.05.00_15 is > waiting on max Client jobs > 867 Increme BucketArchivalData.2010-05-13_23.11.00_19 is waiting on max > Client jobs > 871 Increme BucketSystem.2010-05-13_23.15.00_23 is waiting on max Client > jobs > 873 Increme FlourySystem.2010-05-13_23.15.00_25 is waiting on max Client > jobs > 875 Full BackupCatalog.2010-05-13_23.20.00_27 is waiting for higher > priority jobs to finish > 876 Increme BucketActiveData.2010-05-14_13.00.00_28 is waiting on max > Client jobs > 877 Increme FlouryProductionAndPongrassData.2010-05-14_13.00.00_29 is > waiting on max Client jobs > 878 Increme PascalProductionAndPongrassData.2010-05-14_13.00.00_30 is > waiting on max Client jobs > 881 Increme BucketActiveData.2010-05-14_23.05.00_33 is waiting on max > Client jobs > 882 Increme FlouryProductionAndPongrassData.2010-05-14_23.05.00_34 is > waiting on max Client jobs > 883 Increme PascalProductionAndPongrassData.2010-05-14_23.05.00_35 is > waiting on max Client jobs > 887 Increme BucketArchivalData.2010-05-14_23.11.00_39 is waiting on max > Client jobs > 891 Increme BucketSystem.2010-05-14_23.15.00_43 is waiting on max Client > jobs > 893 Increme FlourySystem.2010-05-14_23.15.00_45 is waiting on max Client > jobs > 895 Full BackupCatalog.2010-05-14_23.20.00_47 is waiting execution > 896 VolumeT VerifyBucketArchivalData.2010-05-15_12.00.00_48 is waiting on > max Client jobs > 898 Increme BucketActiveData.2010-05-15_13.00.00_50 is waiting on max > Client jobs > 899 Increme FlouryProductionAndPongrassData.2010-05-15_13.00.00_51 is > waiting on max Client jobs > 900 Increme PascalProductionAndPongrassData.2010-05-15_13.00.00_52 is > waiting on max Client jobs > 903 Increme BucketActiveData.2010-05-15_23.05.00_55 is waiting on max > Client jobs > 904 Increme FlouryProductionAndPongrassData.2010-05-15_23.05.00_56 is > waiting on max Client jobs > 905 Increme PascalProductionAndPongrassData.2010-05-15_23.05.00_57 is > waiting on max Client jobs > 909 Increme BucketArchivalData.2010-05-15_23.11.00_01 is waiting on max > Client jobs > 913 Increme BucketSystem.2010-05-15_23.15.00_05 is waiting on max Client > jobs > 915 Increme FlourySystem.2010-05-15_23.15.00_07 is waiting on max Client > jobs > 917 Full BackupCatalog.2010-05-15_23.20.00_09 is waiting execution > 918 Full BucketActiveData.2010-05-16_23.05.00_11 is waiting on max > Client jobs > 919 Full FlouryProductionAndPongrassData.2010-05-16_23.05.00_12 is > waiting on max Client jobs > 920 Full PascalProductionAndPongrassData.2010-05-16_23.05.00_13 is > waiting on max Client jobs > 927 Differe BucketSystem.2010-05-16_23.15.00_20 is waiting on max Client > jobs > 929 Differe FlourySystem.2010-05-16_23.15.00_22 is waiting on max Client > jobs > 931 Full BackupCatalog.2010-05-16_23.20.00_24 is waiting execution > 932 Increme BucketActiveData.2010-05-17_13.00.00_25 is waiting on max > Client jobs > 933 Increme FlouryProductionAndPongrassData.2010-05-17_13.00.00_26 is > waiting on max Client jobs > 934 Increme PascalProductionAndPongrassData.2010-05-17_13.00.00_27 is > waiting on max Client jobs > 937 Increme BucketActiveData.2010-05-17_23.05.00_30 is waiting on max > Client jobs > 938 Increme FlouryProductionAndPongrassData.2010-05-17_23.05.00_31 is > waiting on max Client jobs > 939 Increme PascalProductionAndPongrassData.2010-05-17_23.05.00_32 is > waiting on max Client jobs > 943 Increme BucketArchivalData.2010-05-17_23.11.00_37 is waiting on max > Client jobs > 947 Increme BucketSystem.2010-05-17_23.15.00_41 is waiting on max Client > jobs > 949 Increme FlourySystem.2010-05-17_23.15.00_43 is waiting on max Client > jobs > 951 Full BackupCatalog.2010-05-17_23.20.00_45 is waiting execution I'm not sure I understand the point of the SD's behaviour here. Shouldn't it fail with an error if it can't find the volume? Or (IMO even better) simply create it? For that matter, shouldn't volume creation be done during an open transaction with PostgreSQL, so that the volume record isn't actually committed into the database until the file system operations to label it have succeeded? -- Craig Ringer Tech-related writing: http://soapyfrogs.blogspot.com/ ------------------------------------------------------------------------------ _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users