Hello Eric, Background (somewhat simplified): The Accurate delete/restore project (item #1) is now implemented (thanks to Eric). For it to work, the client machine must maintain a temporary table supplied by the Director for each job. This table contains all the files that are in the current backup from the Director's point of view at the beginning of the job. The table allows the FD to determine what files have been added and/or deleted to/from the system that are not yet saved (or deleted).
As currently implemented this table is a hash table using the hash class that I wrote 3 or 4 years ago for this particular project. It is fast and efficient. Problem: Currently the hash table is entirely kept in memory, which means the client uses a rather huge amount of memory, which for systems with 10 million files will probably exceed even the largest amounts of RAM currently used. We would like to have some solution that allows the number of files to grow to 20 million or more and still use less than 1GB of RAM. Recently (the last few days), I implemented what I call BIG_MALLOC for the htable class that allocates memory in 1MB chunks. This makes the code run several times faster with datasets of the order of 5 million by reducing the number of system malloc requests from 5-6 million to less than 500. However, it still does not reduce the memory foot print. Suggestions for reducing the memory requirements: 1. General: Currently each file is kept as a full path and filename. This makes things simple, but means there may be thousands of copies of a long path name (one for each file in the directory). Solution: separate the filename from the path name as is done in the Bacula DB and keep each once in a red/black tree. Benefit: A lot less memory used (perhaps less than half or one fourth). It is quite easy to implement. It might give us a lot of breathing room and still allow us to implement the ideas below later. Some simple tests of the compression ratio on real filesystems should be done before implementing this to ensure it really gives an important reduction in size. Downside: Slightly more overhead. Not a full solution to the problem. 2. General: all the data is currently kept in big 1MB blocks of memory. Solution: we could quite easily use a single block of memory and write subsequent blocks to a file. The OS would work as a paging/cache system, so the pages could be retrieved relatively quickly. Benefit: the number of files would then scale to whatever you could fit on disk. It would be rather easy to implement. Downside: this might be considerably slower because of the fact that the hash code has to follow the links and compare hash codes, and each link it follows for a file (average 5 max 30) could require an I/O. 3. Idea: a variation of #2 would be move the links out of the data item into memory, and thus the paged data would require an I/O only if there are collisions in the buckets, but not for transversing the buckets. Benefit: probably will run *much* faster. Downside: will use more memory. We could fit something like 20 million links in about 250MB of RAM, which is reasonable. 4. Idea: a variation of item #3 would be to generate two hash codes at the same time and to compare both. Benefit: this would *significantly* reduce the need to do a full comparison of the key. Downside: it would cost negligible CPU to compute a second or even third hash at the same time we are computing the first one. The main downside is an increase in memory. For about 20 million files, we would need an additional 80MB of RAM or about 330MB total. This still seems reasonable. 5. Idea: a variation on #4 would be to page the hash links and allow the user to specify the total memory to use. This would be a slightly larger project but is not enormous. Probably not worth doing at least at the moment. 6: Idea: use memory mapped files. Benefit: this would simplify the code and the OS would do the paging. Downside: performance not controlled or guaranteed. Probable *big* porting problems among the various Unixes and Win32. What do you think? Any other ideas? Best regards, Kern ------------------------------------------------------------------------- This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ _______________________________________________ Bacula-devel mailing list Bacula-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/bacula-devel