Cousin Marc wrote: > Hi, > > First an introduction on what I'm trying to do, so you know what we're > playing > with : > > Eric Bollengier and I have been discussing about saving space in the > database, > as ours is getting bigger and bigger (we're nearly at 300 million rows in > file table). > > We're starting with the most obvious : try to make the records more compact > on > disc. There's a lot of space used by the indexes too, but I don't know yet > if it would be possible to have something smaller (they are all used...) > > We're using PostgreSQL 8.2 on production now, so here are our stats. > > What we have now as space used is : > - File is 108 GB for 275845760 records (size includes indexes) > Space is used as follows : > * Indexes : > file_jpfid_idx -> 13 GB > file_fp_idx -> 10GB > file_jobid_idx -> 9GB > file_pkey -> 8GB > * Table itself (68GB): > Administrative overhead (rows headers, block headers) > block headers : 20 bytes per page (108GB equals 14 million pages) => 276 MB > row header : 4 bytes in the block header + 27 bytes in PG 8.2 = 8.5 GB > NB : by using PG 8.3, we can save 4 bytes per row -> "only 7.4 GB" > > Real data : > SELECT attname, avg_width from pg_stats where tablename = 'file'; > attname | avg_width > ------------+----------- > pathid | 4 > filenameid | 4 > lstat | 59 > md5 | 24 > fileid | 4 > fileindex | 4 > jobid | 4 > markid | 4 > > everything is int4 except lstat and md5. > So 4 bytes for every field, except an average of 59 (+4 bytes of overhead) > for > lstat and 24 (+4 bytes of overhead) for md5. A row is about 115 bytes, with > 63 bytes for lstat and 28 for md5. > > So, something around 40GB of data. There is also 30% of free space (too much, > I think, we have to investigate this part) > > Of the 40GB of data, > > > So, I now get to the point : > - we want to try to save some space in file table, so the obvious targets are > lstat and md5. > - we don't want to change bacula code, so the database has to do the dirty > work... > For now, I've been working on the lstat field. It is base64-encoded, and > contains mostly spaces, As and Bs. > > What I've done is create a new datatype 'lstat', and code a huffman > compress/decompress in C, with static weights for every symbol (' ',A,B > vastly outweighting the rest of the codes). The savings are around 12% for > the table (of course, we get nothing on the indexes as lstat isn't indexed). > It costs CPU (about 1s of CPU for 300000 encoded + decoded lstats on my PC), > but may be optimized as I'm not that good at C programming. > > > I now want to try to save some more space on md5 (the gain will be smaller). > md5 is base64 encoded too, but every char has the same weight, so all I can > try is save some space by putting it back in raw form (it will be about 3/4 > of the original size, a bit more as the variable size of the field will add > the same overhead). The problem is that there doesn't seem to be an end of > stream indicator, because the decoding part of bacula's code knows the size > of what it wants to retrieve. > > What I need now is to know exactly what I can expect in the lstat so that I > can encode and decode it correctly in the database backend. Is it 32 bytes > aligned ? (so I can discard the last few bits I may decode). > For instance, the current md5 is 22 chars, so it would decode 22*6=132 bytes. > Are the last 4 discarded ? > If I can't make such hypothesis, I will have to put a size header in my > compressed field. Even 1 byte is 300MB now for our database and will be even > worse in the future... > > Of course, if anyone is interested in the code, I will happily provide it as > is, but I guess that Eric will commit it in the repository when he thinks my > work is usable. For now, it only supports postgresql 8.3, and there is no > automation for installing it...
Please post it. I think seeing it, knowing how it is used, and how it works will be useful in the evaluation of the patch. Well, it's not a Bacula patch, it's a patch against the database. > Anyway, if someone can remove my doubts about bacula's base64, I'd be very > thankful. -- Dan Langille - http://www.langille.org/ BSDCan - The Technical BSD Conference: http://www.bsdcan.org/ PGCon - The PostgreSQL Conference: http://www.pgcon.org/ ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Bacula-devel mailing list Bacula-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-devel