Re: [DISCUSS][APE] Supporting Unlimited Columns

Ritik Raj Fri, 27 Jun 2025 20:58:23 -0700

Headers reside in the zeroth segment of the pageZero page segments. The value 
MaxColumnIndexInZerothSegment indicates the maximum number of columns that can 
exist within this zeroth segment. In that sense, it could also have been named 
NumberOfColumnsInZerothSegment for clarity.


This value is stored in the header because it is part of the configuration, 
which can change. Since this configuration determines how columns are 
distributed across segments, we need to persist MaxColumnIndexInZerothSegment 
in order to accurately compute which segment a given column belongs to.

On 2025/06/27 16:53:26 Mike Carey wrote:
> So maybe its name is a little confusing?  But I guess it’s the same across
> all pages?
> 
> On Fri, Jun 27, 2025 at 1:52 AM Ritik Raj <ri...@apache.org> wrote:
> 
> > yes, I could have been more clear.
> > Thanks for asking.
> >
> > On 2025/06/27 06:42:24 Taewoo Kim wrote:
> > > Thanks for the clarification. So, essentially it means the column count
> > in
> > > the current segment. I somehow thought "max column index" as "the last
> > > column offset in the current segment".
> > >
> > > On Thu, Jun 26, 2025 at 11:37 PM Ritik Raj <ri...@apache.org> wrote:
> > >
> > > > The columnIndex is essentially an ID assigned to each column within an
> > > > LSMIndex, typically ranging from [0, N-1] for N columns.
> > > > Metadata for each column (such as offsets, min/max values, etc.) is
> > stored
> > > > sequentially based on this index.
> > > > For example, if columnIndex = 0 has its offset at position X, then
> > > > columnIndex = 1's offset would be at X + Integer.BYTES.
> > > >
> > > > There are two primary reasons:
> > > > ## Not Penalizing Well-Modeled Schemas
> > > >
> > > > Many customers have a good data model with far fewer than 5000 columns.
> > > > Setting a Max Column Index (e.g., 5000) ensures these users are not
> > > > penalized with additional overhead from extra metadata segments.
> > > > It allows efficient operation for typical use cases without incurring
> > > > unnecessary costs.
> > > >
> > > > ## Simplified Segment Calculation for Column Metadata
> > > >
> > > > This boundary helps determine the segment in which a column's metadata
> > > > resides.
> > > >         For example:
> > > >
> > > >         Max columns in the first segment = 5000
> > > >         Buffer cache page size = 32KB
> > > >
> > > >         If each column's metadata takes X bytes, then the number of
> > > > columns per page/segment is:
> > > >         R = 32KB / X
> > > >         To find which segment column I resides in:
> > > >
> > > >         If I < 5000: it's in segment 0
> > > >         If I ≥ 5000: it's in
> > > >         segment = ((I - 5000) / R) + 1
> > > >
> > > > On 2025/06/26 23:22:01 Taewoo Kim wrote:
> > > > > +1
> > > > >
> > > > > Q: What's the main point of having the "Max Column Index"?
> > > > >
> > > > > Best,
> > > > > Taewoo
> > > > >
> > > > >
> > > > > On Thu, Jun 26, 2025 at 3:33 PM Ian Maxon <ima...@uci.edu> wrote:
> > > > >
> > > > > > +1
> > > > > >
> > > > > > On Thu, Jun 26, 2025 at 3:08 PM Mike Carey <dtab...@gmail.com>
> > wrote:
> > > > > > >
> > > > > > > This looks excellent!
> > > > > > >
> > > > > > > +1 for adopting this extension to our storage system ASAP.
> > > > > > >
> > > > > > > On 6/26/25 11:53 AM, Ritik Raj wrote:
> > > > > > > > *Expanding PageZero to Support Unlimited Columns*
> > > > > > > > APE:
> > > > > > > >
> > > > > >
> > > >
> > https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/ASTERIXDB/APE*23*3A*Unlimited*Columns*Support__;KyUrKys!!CzAuKJ42GuquVTTmVmPViYEvSg!P1EHZSwq7hcpOlyHuy7R1F0lAkJK31elLGusrjb58xBVxuuNH4gxpVwKRuJSv9mByOtN5siVn5A6sQ$
> > > > > > > >
> > > > > > > > In the columnar storage format, each MegaPage represents a
> > logical
> > > > leaf
> > > > > > > > node and begins with `PageZero`, a metadata section that
> > captures
> > > > > > essential
> > > > > > > > column metadata including column offsets and min/max filters.
> > > > > > Originally,
> > > > > > > > `PageZero` was constrained to reside in a single page
> > (typically
> > > > > > 128KB),
> > > > > > > > with a fixed layout that stored information for **every
> > column**
> > > > in the
> > > > > > > > global schema.
> > > > > > > >
> > > > > > > > Each column entry consumed 4 bytes for offset and 16 bytes for
> > a
> > > > > > min/max
> > > > > > > > filter, leading to a **metadata footprint of 20 bytes per
> > column**.
> > > > > > With
> > > > > > > > this layout, the **maximum number of columns supported was
> > capped
> > > > at
> > > > > > > > ~6,000**, given space constraints and the need to reserve part
> > of
> > > > > > > > `PageZero` for primary key metadata and structural headers.
> > > > > > > >
> > > > > > > > This limitation became problematic for datasets with **wide or
> > > > sparse
> > > > > > > > schemas**, where many columns may be missing in individual
> > document
> > > > > > batches
> > > > > > > > but still occupy space in `PageZero`. The presence of unused
> > > > metadata
> > > > > > > > bloated the footprint and limited scalability.
> > > > > > > >
> > > > > > > > *Multi-Segment PageZero: Motivation and Layout*
> > > > > > > >
> > > > > > > > To overcome this limitation, we introduce **multi-segment
> > support
> > > > in
> > > > > > > > PageZero**. Instead of storing all metadata in a single fixed
> > > > block, we
> > > > > > > > partition PageZero into multiple **segments**, with the **first
> > > > > > (zeroth)
> > > > > > > > segment storing primary key metadata and as many column
> > entries as
> > > > it
> > > > > > can
> > > > > > > > fit**, and subsequent segments storing the remaining metadata.
> > > > > > > >
> > > > > > > > Each segment follows the same layout: column index → offset →
> > min →
> > > > > > max,
> > > > > > > > stored in an interleaved manner. This structure ensures
> > efficient
> > > > scan
> > > > > > and
> > > > > > > > lookup, while enabling us to scale to **arbitrarily many
> > columns**,
> > > > > > bounded
> > > > > > > > only by MegaPage size.
> > > > > > > >
> > > > > > > > *Segment Layout:*
> > > > > > > >
> > > > > > > > ```
> > > > > > > > [ Segment Header ]
> > > > > > > >   ├─ Number of Columns
> > > > > > > >   ├─ Max Column Index in Segment
> > > > > > > > [ Interleaved Metadata Entries ]
> > > > > > > >   ├─ ColumnIndex₁, Offset₁, Min₁, Max₁
> > > > > > > >   ├─ ColumnIndex₂, Offset₂, Min₂, Max₂
> > > > > > > >   └─ ...
> > > > > > > > ```
> > > > > > > >
> > > > > > > > A new `DefaultColumnMultiPageZeroWriter` class was introduced
> > to
> > > > manage
> > > > > > > > this segmented layout. It delegates metadata writing to
> > individual
> > > > > > segments
> > > > > > > > while maintaining headers at the top-level for navigation.
> > > > > > > >
> > > > > > > > *Adaptive Writer Selection*
> > > > > > > >
> > > > > > > > To avoid burdening all batches with this segmented structure,
> > we
> > > > > > retain the
> > > > > > > > `DefaultColumnPageZeroWriter` for small or dense schemas. A new
> > > > > > **adaptive
> > > > > > > > selection mechanism** compares space usage of both writers for
> > a
> > > > batch
> > > > > > and
> > > > > > > > picks the optimal one.
> > > > > > > >
> > > > > > > > The decision logic weighs:
> > > > > > > > - Space taken by Default Multi-segment writer (fixed layout
> > for all
> > > > > > columns)
> > > > > > > > - Space taken by Sparse Multi-Segment writer (compact layout
> > for
> > > > > > present
> > > > > > > > columns)
> > > > > > > >
> > > > > > > > This logic is encapsulated in `PageZeroWriterFlavorSelector`.
> > > > > > > >
> > > > > > > > *New Configuration Options:*
> > > > > > > >
> > > > > > > > Two new storage configuration parameters have been introduced:
> > > > > > > >
> > > > > > > > 1. **`STORAGE_MAX_COLUMNS_IN_ZEROTH_SEGMENT`**
> > > > (`INTEGER_BYTE_UNIT`,
> > > > > > > > default: `5000`)
> > > > > > > >     Controls the maximum number of columns that can be stored
> > in
> > > > the
> > > > > > zeroth
> > > > > > > > segment of `PageZero`. Remaining columns, if any, are
> > offloaded to
> > > > > > > > additional segments. This helps balance lookup performance
> > (fast
> > > > for
> > > > > > zeroth
> > > > > > > > segment) and scalability. This might change based on perf
> > > > experiments.
> > > > > > > >
> > > > > > > > 2. **`STORAGE_PAGE_ZERO_WRITER`** (`STRING`, default:
> > `"default"`)
> > > > > > > >     Controls the writer strategy used during flush. Accepted
> > values
> > > > > > are:
> > > > > > > >     - `"default"`: Always use the legacy writer.
> > > > > > > >     - `"sparse"`: Always use the sparse writer (only present
> > > > columns).
> > > > > > > >     - `"adaptive"`: Dynamically compare both and pick the
> > writer
> > > > that
> > > > > > uses
> > > > > > > > less space.
> > > > > > > >
> > > > > > > > *Summary of Changes*
> > > > > > > >
> > > > > > > > - Interleaved layout per segment for columnIndex, offset, min,
> > max.
> > > > > > > > - Logic to estimate the number of segments and assign columns
> > to
> > > > > > segments.
> > > > > > > > - Writer is selected dynamically using
> > > > `PageZeroWriterFlavorSelector`.
> > > > > > > >
> > > > > > > > *Benefits*
> > > > > > > >
> > > > > > > > - Unlocks support for **tens of thousands of columns** per
> > > > MegaPage.
> > > > > > > > - Better space efficiency for sparse batches.
> > > > > > > > - Retains backward compatibility: Already ingested MegaLeafs
> > can
> > > > also
> > > > > > be
> > > > > > > > read.
> > > > > > > >
> > > > > > > > This change is essential for evolving workloads that
> > increasingly
> > > > rely
> > > > > > on
> > > > > > > > flexible schemas and sparse data layouts.
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS][APE] Supporting Unlimited Columns

Reply via email to