Hi folks I've stumbled across VSIReadDirRecursive being really slow when I give it a ridiculously large ZIP file (containing 5 million files across ~1500 subdirectories)
I spent a while poking round the source code. It looks like VSIArchiveFilesystemHandler::ReadDirEx() performs repeated linear scans through the flat VSIArchiveContent::entries array during recursive directory traversal. For each directory level, it scans all entries from the beginning, resulting in O(n²) time complexity. Performance degrades from ~1.3s for the first 5,000 files to ~6.7s for 5000 files once I get 100K files into a 5-million-file ZIP archive, and keeps getting worse from there. I haven't managed to list the whole 5M-file archive yet... A couple of possible solutions: 1. Add a directory index to VSIArchiveContent (add a map of string directory paths to index in the entries array) so we can jumpstart the ReadDirEx implementation at the right place 2. make a VSIDIRArchive class (subclass of VSIDIR) and override OpenDir/NextDirEntry, so that it doens't call ReadDirEx repeatedly but instead just returns entries from the VSIArchiveContent::entries array. I'm leaning towards (1) because it would presumably improve random lookups by file path also (not just ReadDirRecursive). Is this something that would be accepted as a PR? Thanks -- Regards, Craig Platform Engineer Koordinates koordinates.com / @koordinates <https://twitter.com/koordinates>
_______________________________________________ gdal-dev mailing list gdal-dev@lists.osgeo.org https://lists.osgeo.org/mailman/listinfo/gdal-dev