Ted, I cc'd you. Could you please have a look at the save_records function in the middle of my mail and tell us whether its safe to use on Ext4 at least. I understand there might be a problem when using it on XFS, as XFS doesn't cover the rename case. Thanks.
Hi! It ate it, about 13 days ago - on my ThinkPad T42: shambhala:~> uprecords | cut -c1-66 # Uptime | System ----------------------------+------------------------------------- 1 10 days, 21:01:41 | Linux 2.6.37-rc3-tp42 Fri Nov 26 2 2 days, 02:09:03 | Linux 2.6.37-rc3-tp42 Wed Nov 24 3 0 days, 13:59:05 | Linux 2.6.37-rc3-tp42 Tue Nov 23 4 0 days, 06:40:23 | Linux 2.6.36-tp42-gtt-vr Tue Nov 23 -> 5 0 days, 02:04:05 | Linux 2.6.37-rc3-tp42 6 0 days, 00:41:55 | Linux 2.6.37-rc3-tp42 Tue Nov 23 ----------------------------+------------------------------------- 1up in 0 days, 04:36:19 | at Tue Dec 7 no1 in 10 days, 18:57:37 | at Sat Dec 18 up 13 days, 22:36:12 | since Tue Nov 23 down 0 days, 00:06:49 | since Tue Nov 23 %up 99.966 | since Tue Nov 23 I don't remember what might have happened at that time. Its not the first time. I already restored it from a backup in october: shambhala:~> ls -l /var/spool/uptimed insgesamt 28 -rw-r--r-- 1 daemon daemon 11 7. Dez 10:50 bootid -rw-r--r-- 1 root root 254 7. Dez 12:35 records -rw-r--r-- 1 daemon daemon 9806 3. Mär 2010 records-2010-03-03-aus-dem- rsync-backup -rw-r--r-- 1 daemon daemon 1450 9. Mär 2010 records-2010-03-09- unvollstaendig -rw-r--r-- 1 daemon daemon 254 7. Dez 12:30 records.old As you see the last working backup here is 9802 bytes, way bigger than the current file. This is on a shambhala:~> df -hT /var/spool/uptimed Dateisystem Typ Size Used Avail Use% Eingehängt auf /dev/mapper/shambhala-debian ext4 20G 14G 5,5G 72% / and a quite recent kernel 2.6.36 / 2.6.37-rc3 which has the Ext4 safeguard for the rename and truncate case which was introduced in 2.6.30 I believe - that it will flush written data *before* renaming the file. But according to libuptimed/urec.d 247 void save_records(int max, time_t log_threshold) { 248 »·······FILE *f; 249 »·······Urec *u; 250 »·······int i = 0; 251 »······· 252 »·······f = fopen(FILE_RECORDS".tmp", "w"); 253 »·······if (!f) { 254 »·······»·······printf("uptimed: cannot write to %s\n", FILE_RECORDS); 255 »·······»·······return; 256 »·······} 257 258 »·······for (u = urec_list; u; u = u->next) { 259 »·······»·······/* Ignore everything below the threshold */ 260 »·······»·······if (u->utime >= log_threshold) { 261 »·······»·······»·······fprintf(f, "%lu:%lu:%s\n", (unsigned long)u- >utime, (unsigned long)u->btime, u->sys); 262 »·······»·······»·······/* Stop processing when we've logged the max number specified. */ 263 »·······»·······»·······if ((max > 0) && (++i >= max)) break; 264 »·······»·······} 265 »·······} 266 »·······fclose(f); 267 »·······rename(FILE_RECORDS, FILE_RECORDS".old"); 268 »·······rename(FILE_RECORDS".tmp", FILE_RECORDS); 269 } uptimed uses the rename case. Thus I do not get, *why* it ate my old records again. Nonetheless, I think there should be a safeguard, like using the old file if the current one is empty. I would also keep more than one backup given the small size of this file. Maybe logrotate can do this while keeping the original file instead of truncating it. I have the following configuration: shambhala:~> cat /etc/uptimed.conf # Uptimed configuration file. # Interval to write the logfile with in seconds. UPDATE_INTERVAL=300 # Maximum number of entries in logfile. Set to 0 for unlimited. LOG_MAXIMUM_ENTRIES=0 # Minimum uptime that must be reached for it to be considered a record. LOG_MINIMUM_UPTIMED=1h [...] An option to fsync() would be fine, thus people here can easily test, whether fsync helps in that case. Then there is the slight chance that uptimed gets confused during runtime and writes out an empty configuration file by accident. But I find this highly unlikely. I will restore as much as possible from my backup. Its easily possible to combine the contents of a backup and a new records file. I also lost the records on a Lenny => Squeeze update on my Dell workstation at work. So this is three losses within just a few month. In the current state, uptimed is hardly usable for me. For now I done a backup for myself as fcrontab jobs: # Backup der uptimed-Datenbank @ 1d cp -p /var/spool/uptimed/records ~/Backup/uptimed/records-$(date +%Y-%M-%d) @ 30d find ~/Backup/uptimed/ -name "records-*" -and -mtime +30 -delete Something like that should go into uptimed or a cron-job that comes with the package. Could be a cron.daily or at least cron.weekly job (using some directory in /var for backups). So, I hope this was enough constructive feedback to show what can be done about it. I can craft up a cron-job for the uptimed package if you want that does the backup. I am not that much into C programming currently, but eventually I could come up with a patch for uptimed as well. But I think this bug needs acknowledgment as being serious cause data loss is involved. Just denying that there is a problem, doesn't help proceeding further. A user of uptimed IMHO rightly does not care whether its a problem in the kernel, the filesystem, or the userspace program. Ciao, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
signature.asc
Description: This is a digitally signed message part.