clintropolis opened a new pull request, #19282:
URL: https://github.com/apache/druid/pull/19282

   ### Description
   This PR adds the building blocks for supporting partial segment download 
when using vsf segment cache, introducing a new `SegmentRangeReader` interface 
which will allow deep storage extensions to provide byte-range reads from 
segment files in deep storage.
   
   To consume this interface, a new `PartialSegmentFileMapperV10` class has 
been added that works by fetching the 'header' portion of a v10 segment (that 
is not externally compressed, e.g. .zip) during creating and storing it to disk 
so that it has the metadata and positions of all of the internal files of the 
segment which make up the columns. In addition to this file, we also append a 
bitmap (one bit per internal file of the segment) which is mmapped read-write 
and updated with a single-byte read-modify-write under a lock whenever an 
internal file is fetched.
   
   Fetched internal files are stored in separate local 'container' files, which 
correspond to the containers of the v10 format so that we can just re-use the 
positions of all of the internal files within the containers. The container 
files themselves are created as 'sparse' files at the original container size; 
downloaded file bytes are written at their original offsets via 
`RandomAccessFile`, and the read-only mmap sees writes through the shared page 
cache.
   
   Follow-ups to this PR will begin the work of wiring this stuff up to 
actually be used in the segment cache and to ultimately allow query engines to 
specify what segment parts they need to allow fetching the minimum amount of 
data possible in order for query processing.
   
   Initially at least, I am thinking for projections to be the level of 
'granularity' for how the segment chunks are accounted for in the segment cache 
(so like the 'size' in the cache will be the size of the whole projection, it 
will just be lazily filled in as downloaded), so I will also be doing a 
follow-up to better organize the projections into containers in 
`SegmentFileBuilderV10` instead of just filling whole containers at a time so 
that we have an easy way to map eviction to deleting these container files.
   
   changes:
   * adds new `SegmentRangeReader` extension point interface for byte-range 
reads from segment files in deep storage
   * adds `PartialSegmentFileMapperV10` a `SegmentFileMapper` implementation 
that downloads internal files on demand from deep storage via 
`SegmentRangeReader`, not wired to anything yet other than tests
   * extracted `SegmentFileMetadataReader` which is a shared utility for 
parsing V10 header + metadata from any `InputStream` from 
`SegmentFileMapperV10.create()` so it can be shared with 
`PartialSegmentFileMapperV10`
   * adds `openRangeReader()` method to `LoadSpec` with a default 
implementation that returns null
   * `SegmentFileMetadata` now interns string keys in files and column 
descriptor maps using `SmooshedFileMapper.STRING_INTERNER`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to