On Monday 04 April 2005 18:35, Matt White wrote: > Kern Sibbald wrote: > > Well, there are two obvious ways of speeding up inserting of attributes. > > > > 1. Cache the attributes, and combine them with the MD5/SHA1 signature > > that follows each attribute, and then do a single insert rather than an > > insert of the attribute followed by an update for the MD5/SHA1 signature. > > This would benefit all DB versions (SQLite, MySQL and PostgreSQL). Note, > > the user can turn off MD5 signatures so they may not always exist. Also, > > they are not generated for directories, and other special files. > > I'll take a shot at that... > > I've had a look at the code, and just had a couple of questions before I > started: > > It looks like I'll need to cache the file index, job id, path id, > filename id, attributes in db_create_file_attributes_record(), then pull > them out of the cache in db_add_SIG_to_file_record() and do the actual > insert (or just do both parts in catalog_update() and > get_attributes_and_put_in_catalog()). It looks like fileindex is what I > need to use for a key. As each one gets inserted, it needs to come out > of the cache (or be marked as used, at least), and then at the end, > everything still in the cache can be inserted with a 0 sig to take care > of files/dirs with no sig...does that sound like about the right > process, or am I suffering from Monday morning delusions? > > Each cached record is going to be 16 bytes plus the size of the > attributes, so I don't think we can cache them in memory :-) Any > objections to a dbm-based cache? It would also allow a recovery > of the attributes if something dies before they get inserted...
You are close, but it is probably a lot easier than you think (once you know more of the details). In src/dird/catreq.c, you need only cache one copy of the ATTR_DBR ar record that is generated in catalog_update() before it is sent to db_create_file_attributes_record(). Then in the code that checks if the signature belongs to the last ar record, you simply need to replace the call to db_add_SIG_to_file_record() with a call to db_create_file_attributes_record(). There are a few more details: 1. You must expand the attributes record to include the signature see cats.h (just copy char SIG[50] from the FILE_DBR. 2. You must stuff the signature into the record. 3. You must have a flag that indicates that the record has a signature. 4. You must modify db_create_file_record() in src/cats to use the value of the signature in the ar packet rather than 0 when it is doing the insert. 5. Then instead of doing the db_add_SIG... you call db_create_file_attributes_record(). 6. Finally, and very important, you must have some way to flush out any last cached attribute record if it is not followed by a signature at the end of the job. The same change could be made to src/dird/fd_cmds.c -- this code is used only for Verify jobs, whereas the code in catreq.c is used for Backup jobs (more important). > > > For PostgreSQL, there is one other thing I can think of that might be > done to speed things up...single sql statements are executed with an > implicit BEGIN TRANSACTION/END TRANSACTION. If it's doable without > affecting the other databases, would you be receptive to a patch to > wrap the inserts in a transaction (probably a new transaction every > 1000 or 5000 records or so). I just ran a very quick test: > > CREATE TABLE testing ( > num1 int8 PRIMARY KEY NOT NULL, > num2 int8); > > Script 1: test bare insert and update: > loop from 0-1000, do an insert of num1 then update num2 with random # > > testins 0.33s user 0.20s system 1% cpu 47.331 total > > Script 2: wrap loop from above script with a begin/end transaction > block: > > testtrans 0.07s user 0.03s system 5% cpu 1.845 total As I mentioned in a previous email, this strategy will work only if one job is active at a time -- for the reason, please see Martin Simmons (if I remember right) email on this subject. It would be relatively easy to find out how many jobs are using the database at the same time since there is a reference count in the database packet. The problem gets slightly more complicated if one job starts a transaction, and another wants to start using the database -- it might not be too hard to turn transactions off, and block the second job until the first job noticed and terminated the transaction, but all that is a bit messy ... -- Best regards, Kern ------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Bacula-users mailing list Bacula-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-users