Hi, First an introduction on what I'm trying to do, so you know what we're playing with :
Eric Bollengier and I have been discussing about saving space in the database, as ours is getting bigger and bigger (we're nearly at 300 million rows in file table). We're starting with the most obvious : try to make the records more compact on disc. There's a lot of space used by the indexes too, but I don't know yet if it would be possible to have something smaller (they are all used...) We're using PostgreSQL 8.2 on production now, so here are our stats. What we have now as space used is : - File is 108 GB for 275845760 records (size includes indexes) Space is used as follows : * Indexes : file_jpfid_idx -> 13 GB file_fp_idx -> 10GB file_jobid_idx -> 9GB file_pkey -> 8GB * Table itself (68GB): Administrative overhead (rows headers, block headers) block headers : 20 bytes per page (108GB equals 14 million pages) => 276 MB row header : 4 bytes in the block header + 27 bytes in PG 8.2 = 8.5 GB NB : by using PG 8.3, we can save 4 bytes per row -> "only 7.4 GB" Real data : SELECT attname, avg_width from pg_stats where tablename = 'file'; attname | avg_width ------------+----------- pathid | 4 filenameid | 4 lstat | 59 md5 | 24 fileid | 4 fileindex | 4 jobid | 4 markid | 4 everything is int4 except lstat and md5. So 4 bytes for every field, except an average of 59 (+4 bytes of overhead) for lstat and 24 (+4 bytes of overhead) for md5. A row is about 115 bytes, with 63 bytes for lstat and 28 for md5. So, something around 40GB of data. There is also 30% of free space (too much, I think, we have to investigate this part) Of the 40GB of data, So, I now get to the point : - we want to try to save some space in file table, so the obvious targets are lstat and md5. - we don't want to change bacula code, so the database has to do the dirty work... For now, I've been working on the lstat field. It is base64-encoded, and contains mostly spaces, As and Bs. What I've done is create a new datatype 'lstat', and code a huffman compress/decompress in C, with static weights for every symbol (' ',A,B vastly outweighting the rest of the codes). The savings are around 12% for the table (of course, we get nothing on the indexes as lstat isn't indexed). It costs CPU (about 1s of CPU for 300000 encoded + decoded lstats on my PC), but may be optimized as I'm not that good at C programming. I now want to try to save some more space on md5 (the gain will be smaller). md5 is base64 encoded too, but every char has the same weight, so all I can try is save some space by putting it back in raw form (it will be about 3/4 of the original size, a bit more as the variable size of the field will add the same overhead). The problem is that there doesn't seem to be an end of stream indicator, because the decoding part of bacula's code knows the size of what it wants to retrieve. What I need now is to know exactly what I can expect in the lstat so that I can encode and decode it correctly in the database backend. Is it 32 bytes aligned ? (so I can discard the last few bits I may decode). For instance, the current md5 is 22 chars, so it would decode 22*6=132 bytes. Are the last 4 discarded ? If I can't make such hypothesis, I will have to put a size header in my compressed field. Even 1 byte is 300MB now for our database and will be even worse in the future... Of course, if anyone is interested in the code, I will happily provide it as is, but I guess that Eric will commit it in the repository when he thinks my work is usable. For now, it only supports postgresql 8.3, and there is no automation for installing it... Anyway, if someone can remove my doubts about bacula's base64, I'd be very thankful. Cheers Marc ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Bacula-devel mailing list Bacula-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-devel