Hi Craig,
your option (1) sounds good to me. However, there is no requirement that
all file entries in a directory are consecutive, so the raw entry list
could potentially be
dirA/file1
dirB/file1
dirA/file2
dirB/file2
depending on how files are inserted in the ZIP. So you likely need to
sort things before creating your index.
Even
Le 19/08/2025 à 07:26, Craig de Stigter via gdal-dev a écrit :
Hi folks
I've stumbled across VSIReadDirRecursive being really slow when I give
it a ridiculously large ZIP file (containing 5 million files across
~1500 subdirectories)
I spent a while poking round the source code. It looks like
VSIArchiveFilesystemHandler::ReadDirEx() performs repeated linear
scans through the flat VSIArchiveContent::entries array during
recursive directory traversal. For each directory level, it scans all
entries from the beginning, resulting in O(n²) time complexity.
Performance degrades from ~1.3s for the first 5,000 files to ~6.7s for
5000 files once I get 100K files into a 5-million-file ZIP archive,
and keeps getting worse from there. I haven't managed to list the
whole 5M-file archive yet...
A couple of possible solutions:
1. Add a directory index to VSIArchiveContent (add a map of string
directory paths to index in the entries array) so we can jumpstart the
ReadDirEx implementation at the right place
2. make a VSIDIRArchive class (subclass of VSIDIR) and override
OpenDir/NextDirEntry, so that it doens't call ReadDirEx repeatedly but
instead just returns entries from the VSIArchiveContent::entries array.
I'm leaning towards (1) because it would presumably improve random
lookups by file path also (not just ReadDirRecursive). Is this
something that would be accepted as a PR?
Thanks
--
Regards,
Craig
Platform Engineer
Koordinates
koordinates.com <http://koordinates.com/> / @koordinates
<https://twitter.com/koordinates>
_______________________________________________
gdal-dev mailing list
gdal-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/gdal-dev
--
http://www.spatialys.com
My software is free, but my time generally not.
_______________________________________________
gdal-dev mailing list
gdal-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/gdal-dev