clintropolis opened a new pull request, #19282: URL: https://github.com/apache/druid/pull/19282
### Description This PR adds the building blocks for supporting partial segment download when using vsf segment cache, introducing a new `SegmentRangeReader` interface which will allow deep storage extensions to provide byte-range reads from segment files in deep storage. To consume this interface, a new `PartialSegmentFileMapperV10` class has been added that works by fetching the 'header' portion of a v10 segment (that is not externally compressed, e.g. .zip) during creating and storing it to disk so that it has the metadata and positions of all of the internal files of the segment which make up the columns. In addition to this file, we also append a bitmap (one bit per internal file of the segment) which is mmapped read-write and updated with a single-byte read-modify-write under a lock whenever an internal file is fetched. Fetched internal files are stored in separate local 'container' files, which correspond to the containers of the v10 format so that we can just re-use the positions of all of the internal files within the containers. The container files themselves are created as 'sparse' files at the original container size; downloaded file bytes are written at their original offsets via `RandomAccessFile`, and the read-only mmap sees writes through the shared page cache. Follow-ups to this PR will begin the work of wiring this stuff up to actually be used in the segment cache and to ultimately allow query engines to specify what segment parts they need to allow fetching the minimum amount of data possible in order for query processing. Initially at least, I am thinking for projections to be the level of 'granularity' for how the segment chunks are accounted for in the segment cache (so like the 'size' in the cache will be the size of the whole projection, it will just be lazily filled in as downloaded), so I will also be doing a follow-up to better organize the projections into containers in `SegmentFileBuilderV10` instead of just filling whole containers at a time so that we have an easy way to map eviction to deleting these container files. changes: * adds new `SegmentRangeReader` extension point interface for byte-range reads from segment files in deep storage * adds `PartialSegmentFileMapperV10` a `SegmentFileMapper` implementation that downloads internal files on demand from deep storage via `SegmentRangeReader`, not wired to anything yet other than tests * extracted `SegmentFileMetadataReader` which is a shared utility for parsing V10 header + metadata from any `InputStream` from `SegmentFileMapperV10.create()` so it can be shared with `PartialSegmentFileMapperV10` * adds `openRangeReader()` method to `LoadSpec` with a default implementation that returns null * `SegmentFileMetadata` now interns string keys in files and column descriptor maps using `SmooshedFileMapper.STRING_INTERNER` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
