On Wed, Sep 17, 2014 at 05:02:33PM -0000, s...@apache.org wrote: > Author: stsp > Date: Wed Sep 17 17:02:33 2014 > New Revision: 1625674 > > URL: http://svn.apache.org/r1625674 > Log: > Fix a big scalability problem in the implementation of svnpredumpfilter.py. > > The script kept re-computing the set of additional include paths while > mining the log history for copied paths. Each re-computation involved > a full iteration of the set of copies accumulated so far, which made > the run time explode on large repositories. > Instead, we can gather all copies first, and then iterate them at once. > > In my testing this change reduces the runtime of svnpredumpfilter.py on > a 64GB large dump file of the FreeBSD repository (up to r271458) from > several days(!) to 1.5 minutes. > > * tools/server-side/svnpredumpfilter.py > (svn_log_stream_get_dependencies): Run dt.handle_changes() once the log > history has been fully scanned, not for each revision.
It is possible that there is a slight regression with this change. Currently the script is only detecting direct copy sources of the to-be-included set of paths, but not copy sources of copy sources. I'm working on a fix for this problem that doesn't involve reverting this change and still lets the script complete its task within a reasonable amount of time. > > Modified: > subversion/trunk/tools/server-side/svnpredumpfilter.py > > Modified: subversion/trunk/tools/server-side/svnpredumpfilter.py > URL: > http://svn.apache.org/viewvc/subversion/trunk/tools/server-side/svnpredumpfilter.py?rev=1625674&r1=1625673&r2=1625674&view=diff > ============================================================================== > --- subversion/trunk/tools/server-side/svnpredumpfilter.py (original) > +++ subversion/trunk/tools/server-side/svnpredumpfilter.py Wed Sep 17 > 17:02:33 2014 > @@ -204,7 +204,6 @@ def svn_log_stream_get_dependencies(stre > sanitize_path(match.group(2)) > else: > break > - dt.handle_changes(path_copies) > > # Finally, skip any log message lines. (If there are none, > # remember the last line we read, because it probably has > @@ -221,6 +220,7 @@ def svn_log_stream_get_dependencies(stre > "'svn log' with the --verbose (-v) option when " > "generating the input to this script?") > > + dt.handle_changes(path_copies) > return dt > > def analyze_logs(included_paths): >