Hi all As part of the ongoing process of trying to trim the catalog a little and make it easier to query, I've been talking to a few folks who've pointed out that much of the `lstat' field in the catalog is actually unnecessary.
Specifically, it's only needed to select files for recovery, not to actually restore them. Things like the uid/gid, mtime and permissions, for example, are really only there to help the user select the file(s) they want. So, I'd like to find out what file attributes Bacula actually *NEEDS* to track in order to operate, and furthermore what attributes people *WANT* it to track to make it easier to select files. Verification is also an issue - must all attributes be verified, even fairly pointless ones like atime and ctime? What about things like the inode number, that don't even make sense to verify? Are there any NEW attributes it'd be useful to have for other platforms, like owning SID (or owning user "DOMAIN\name", maybe) for Windows? ---------------------------------------- [A] AFAIK all it _needs_ for restore is: ---------------------------------------- - File name - [st_size] File size in bytes (for restore verification) - md5sum ----------------------------------------------------------- [B] Things to keep to help users identify files to restore: ----------------------------------------------------------- - [st_uid] User ID (not user name) - [st_gid] Group ID (not group name) - [st_mtime] Last-modified time ---------------------------------------- [C] Things we can and should get rid of: ---------------------------------------- - [st_dev] ID of device containing file - [st_atime] Last accessed time - [st_ctime] Time of last status/inode change - [st_ino] Inode number - [st_rdev] Device ID (if special file) - [st_blksize] System I/O block size - [st_blocks] Number of 512B (not "system block size"?) blocks They're all in the volume files, and aren't all that interesting for purposes of selecting/identifying files. They don't need to be in the catalog. Note that `ctime' is *NOT* the time the file was created. It's the time the inode was last changed. -------------------------- [D] Ones I'm uncertain of: -------------------------- - [st_nlink] Number of hardlinks ... which is useless, except if you want to try to restore hardlinks, in which case it might be a handy check value. On one hand, it's in the volume file. On the other hand, if we do something that'll make database designers cry and store NULL for values of 1, it takes up practically no space. - [st_mode] Access mode (file system permissions) It's already in the volumes, so it's availible for proper restoration. Dropping mode does mean you can't verify restored mode against the mode in the catalog, so mode *might* want to be kept anyway. Personally I don't think it's worth having, since it doesn't add much when selecting files. -------------------------------------- [E] Handy extras not currently tracked -------------------------------------- - Owning user _name_ - Owning group _name_ These can be tracked without storing them individually per-file, by maintaining a per-job uid->user, gid->group mapping in a separate "idmap" table. I'm not sure it's worth it, when it's so easy to just look up the uid/gid on the host you're interested in, but maybe it's worthwhile. Anybody? - Owning user SID Windows only. NULL for other platforms. Do we care about tracking this in the catalog? Will anybody actually use it, or should we just track user names instead? Might want to be able to associate it with "DOMAIN\name" info for ownership name display for Windows backups, though. - File creation time This is not availible on UNIX, so it'd only be there for Windows and Mac OS X (HFS+) clients. Data from UNIX fds would just store NULL here. Is it really useful enough to justify tracking this? ------------------------ PROPOSED NEW FILE SCHEMA ------------------------ Including only things in sections [A] and [B] above, the `file' schema would become a rather leaner: CREATE TABLE file2 ( fileid bigint, fileindex integer, jobid integer, pathid integer, filenameid integer, markid integer, mtime timestamp, st_uid integer, st_gid integer, st_size bigint, linkfi integer, checksum text ); With PostgreSQL, the above schema shrinks the catalog by 4%. I expect better results from other databases that don't store text fields as efficiently. In the above, mtime is defined as a database `timestamp' rather than stored as the raw integer value from the source host. Note that the database may not store timestamps with the same precision as the fd host, and may store them using an internal floating-point form, so some error must be allowed for during verification. The upside of storing `mtime' as a timestamp is that it's easy to query and index, and it has the same meaning no matter what the origin of the file. There may be conversion error, though, and/or differences in precision between the database timestamp and the host timestamp. For example, PostgreSQL uses floating-point-based timestamps by default, and they're not as precise as (eg) NTFS's 64-bit timestamps. Some error must be accounted for in verify. The alternative is storing a `bigint' (64-bit integer) which has a meaning depending on the timestamp precision and epoch of the origin system. Not, IMO, that useful. The `checksum' (was: "md5") field MUST be NULL if if unset/unknown/unknowable (such as for a directory or device node), NOT zero. This saves space on many DBMSs, and is semantically cleaner. It'd be interesting to look at storing `checksum' as a bytea/blob field instead of base64-encoding it, but the gains probably wouldn't be too huge, converting DBs from bacula's wacky almost-base64 would be "interesting", and the FD sends base64-encoded data currently so there'd be compatibility issues to deal with. Not worth it. That does mean that the md5 stored in the catalog is USELESS to apps that aren't bacula, since it's not stored as a binary value or a conventional hex-format MD5. A pity, since a binary-format MD5 would be easy to convert to hex on-demand for external app use, but probably not that big a deal. I've renamed it "checksum" in the above to reflect the fact that it's not necessarily MD5, and it's not going to be in a form most apps will understand either. -- Craig Ringer ------------------------------------------------------------------------------ Stay on top of everything new and different, both inside and around Java (TM) technology - register by April 22, and save $200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco. 300 plus technical and hands-on sessions. Register today. Use priority code J9JMT32. http://p.sf.net/sfu/p _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users