So maybe its name is a little confusing? But I guess it’s the same across all pages?
On Fri, Jun 27, 2025 at 1:52 AM Ritik Raj <ri...@apache.org> wrote: > yes, I could have been more clear. > Thanks for asking. > > On 2025/06/27 06:42:24 Taewoo Kim wrote: > > Thanks for the clarification. So, essentially it means the column count > in > > the current segment. I somehow thought "max column index" as "the last > > column offset in the current segment". > > > > On Thu, Jun 26, 2025 at 11:37 PM Ritik Raj <ri...@apache.org> wrote: > > > > > The columnIndex is essentially an ID assigned to each column within an > > > LSMIndex, typically ranging from [0, N-1] for N columns. > > > Metadata for each column (such as offsets, min/max values, etc.) is > stored > > > sequentially based on this index. > > > For example, if columnIndex = 0 has its offset at position X, then > > > columnIndex = 1's offset would be at X + Integer.BYTES. > > > > > > There are two primary reasons: > > > ## Not Penalizing Well-Modeled Schemas > > > > > > Many customers have a good data model with far fewer than 5000 columns. > > > Setting a Max Column Index (e.g., 5000) ensures these users are not > > > penalized with additional overhead from extra metadata segments. > > > It allows efficient operation for typical use cases without incurring > > > unnecessary costs. > > > > > > ## Simplified Segment Calculation for Column Metadata > > > > > > This boundary helps determine the segment in which a column's metadata > > > resides. > > > For example: > > > > > > Max columns in the first segment = 5000 > > > Buffer cache page size = 32KB > > > > > > If each column's metadata takes X bytes, then the number of > > > columns per page/segment is: > > > R = 32KB / X > > > To find which segment column I resides in: > > > > > > If I < 5000: it's in segment 0 > > > If I ≥ 5000: it's in > > > segment = ((I - 5000) / R) + 1 > > > > > > On 2025/06/26 23:22:01 Taewoo Kim wrote: > > > > +1 > > > > > > > > Q: What's the main point of having the "Max Column Index"? > > > > > > > > Best, > > > > Taewoo > > > > > > > > > > > > On Thu, Jun 26, 2025 at 3:33 PM Ian Maxon <ima...@uci.edu> wrote: > > > > > > > > > +1 > > > > > > > > > > On Thu, Jun 26, 2025 at 3:08 PM Mike Carey <dtab...@gmail.com> > wrote: > > > > > > > > > > > > This looks excellent! > > > > > > > > > > > > +1 for adopting this extension to our storage system ASAP. > > > > > > > > > > > > On 6/26/25 11:53 AM, Ritik Raj wrote: > > > > > > > *Expanding PageZero to Support Unlimited Columns* > > > > > > > APE: > > > > > > > > > > > > > > > > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/ASTERIXDB/APE*23*3A*Unlimited*Columns*Support__;KyUrKys!!CzAuKJ42GuquVTTmVmPViYEvSg!P1EHZSwq7hcpOlyHuy7R1F0lAkJK31elLGusrjb58xBVxuuNH4gxpVwKRuJSv9mByOtN5siVn5A6sQ$ > > > > > > > > > > > > > > In the columnar storage format, each MegaPage represents a > logical > > > leaf > > > > > > > node and begins with `PageZero`, a metadata section that > captures > > > > > essential > > > > > > > column metadata including column offsets and min/max filters. > > > > > Originally, > > > > > > > `PageZero` was constrained to reside in a single page > (typically > > > > > 128KB), > > > > > > > with a fixed layout that stored information for **every > column** > > > in the > > > > > > > global schema. > > > > > > > > > > > > > > Each column entry consumed 4 bytes for offset and 16 bytes for > a > > > > > min/max > > > > > > > filter, leading to a **metadata footprint of 20 bytes per > column**. > > > > > With > > > > > > > this layout, the **maximum number of columns supported was > capped > > > at > > > > > > > ~6,000**, given space constraints and the need to reserve part > of > > > > > > > `PageZero` for primary key metadata and structural headers. > > > > > > > > > > > > > > This limitation became problematic for datasets with **wide or > > > sparse > > > > > > > schemas**, where many columns may be missing in individual > document > > > > > batches > > > > > > > but still occupy space in `PageZero`. The presence of unused > > > metadata > > > > > > > bloated the footprint and limited scalability. > > > > > > > > > > > > > > *Multi-Segment PageZero: Motivation and Layout* > > > > > > > > > > > > > > To overcome this limitation, we introduce **multi-segment > support > > > in > > > > > > > PageZero**. Instead of storing all metadata in a single fixed > > > block, we > > > > > > > partition PageZero into multiple **segments**, with the **first > > > > > (zeroth) > > > > > > > segment storing primary key metadata and as many column > entries as > > > it > > > > > can > > > > > > > fit**, and subsequent segments storing the remaining metadata. > > > > > > > > > > > > > > Each segment follows the same layout: column index → offset → > min → > > > > > max, > > > > > > > stored in an interleaved manner. This structure ensures > efficient > > > scan > > > > > and > > > > > > > lookup, while enabling us to scale to **arbitrarily many > columns**, > > > > > bounded > > > > > > > only by MegaPage size. > > > > > > > > > > > > > > *Segment Layout:* > > > > > > > > > > > > > > ``` > > > > > > > [ Segment Header ] > > > > > > > ├─ Number of Columns > > > > > > > ├─ Max Column Index in Segment > > > > > > > [ Interleaved Metadata Entries ] > > > > > > > ├─ ColumnIndex₁, Offset₁, Min₁, Max₁ > > > > > > > ├─ ColumnIndex₂, Offset₂, Min₂, Max₂ > > > > > > > └─ ... > > > > > > > ``` > > > > > > > > > > > > > > A new `DefaultColumnMultiPageZeroWriter` class was introduced > to > > > manage > > > > > > > this segmented layout. It delegates metadata writing to > individual > > > > > segments > > > > > > > while maintaining headers at the top-level for navigation. > > > > > > > > > > > > > > *Adaptive Writer Selection* > > > > > > > > > > > > > > To avoid burdening all batches with this segmented structure, > we > > > > > retain the > > > > > > > `DefaultColumnPageZeroWriter` for small or dense schemas. A new > > > > > **adaptive > > > > > > > selection mechanism** compares space usage of both writers for > a > > > batch > > > > > and > > > > > > > picks the optimal one. > > > > > > > > > > > > > > The decision logic weighs: > > > > > > > - Space taken by Default Multi-segment writer (fixed layout > for all > > > > > columns) > > > > > > > - Space taken by Sparse Multi-Segment writer (compact layout > for > > > > > present > > > > > > > columns) > > > > > > > > > > > > > > This logic is encapsulated in `PageZeroWriterFlavorSelector`. > > > > > > > > > > > > > > *New Configuration Options:* > > > > > > > > > > > > > > Two new storage configuration parameters have been introduced: > > > > > > > > > > > > > > 1. **`STORAGE_MAX_COLUMNS_IN_ZEROTH_SEGMENT`** > > > (`INTEGER_BYTE_UNIT`, > > > > > > > default: `5000`) > > > > > > > Controls the maximum number of columns that can be stored > in > > > the > > > > > zeroth > > > > > > > segment of `PageZero`. Remaining columns, if any, are > offloaded to > > > > > > > additional segments. This helps balance lookup performance > (fast > > > for > > > > > zeroth > > > > > > > segment) and scalability. This might change based on perf > > > experiments. > > > > > > > > > > > > > > 2. **`STORAGE_PAGE_ZERO_WRITER`** (`STRING`, default: > `"default"`) > > > > > > > Controls the writer strategy used during flush. Accepted > values > > > > > are: > > > > > > > - `"default"`: Always use the legacy writer. > > > > > > > - `"sparse"`: Always use the sparse writer (only present > > > columns). > > > > > > > - `"adaptive"`: Dynamically compare both and pick the > writer > > > that > > > > > uses > > > > > > > less space. > > > > > > > > > > > > > > *Summary of Changes* > > > > > > > > > > > > > > - Interleaved layout per segment for columnIndex, offset, min, > max. > > > > > > > - Logic to estimate the number of segments and assign columns > to > > > > > segments. > > > > > > > - Writer is selected dynamically using > > > `PageZeroWriterFlavorSelector`. > > > > > > > > > > > > > > *Benefits* > > > > > > > > > > > > > > - Unlocks support for **tens of thousands of columns** per > > > MegaPage. > > > > > > > - Better space efficiency for sparse batches. > > > > > > > - Retains backward compatibility: Already ingested MegaLeafs > can > > > also > > > > > be > > > > > > > read. > > > > > > > > > > > > > > This change is essential for evolving workloads that > increasingly > > > rely > > > > > on > > > > > > > flexible schemas and sparse data layouts. > > > > > > > > > > > > > > > > > > > > > >