Re: [DISCUSS][APE] Supporting Unlimited Columns

Mike Carey Fri, 27 Jun 2025 09:53:56 -0700

So maybe its name is a little confusing?  But I guess it’s the same across
all pages?


On Fri, Jun 27, 2025 at 1:52 AM Ritik Raj <ri...@apache.org> wrote:

> yes, I could have been more clear.
> Thanks for asking.
>
> On 2025/06/27 06:42:24 Taewoo Kim wrote:
> > Thanks for the clarification. So, essentially it means the column count
> in
> > the current segment. I somehow thought "max column index" as "the last
> > column offset in the current segment".
> >
> > On Thu, Jun 26, 2025 at 11:37 PM Ritik Raj <ri...@apache.org> wrote:
> >
> > > The columnIndex is essentially an ID assigned to each column within an
> > > LSMIndex, typically ranging from [0, N-1] for N columns.
> > > Metadata for each column (such as offsets, min/max values, etc.) is
> stored
> > > sequentially based on this index.
> > > For example, if columnIndex = 0 has its offset at position X, then
> > > columnIndex = 1's offset would be at X + Integer.BYTES.
> > >
> > > There are two primary reasons:
> > > ## Not Penalizing Well-Modeled Schemas
> > >
> > > Many customers have a good data model with far fewer than 5000 columns.
> > > Setting a Max Column Index (e.g., 5000) ensures these users are not
> > > penalized with additional overhead from extra metadata segments.
> > > It allows efficient operation for typical use cases without incurring
> > > unnecessary costs.
> > >
> > > ## Simplified Segment Calculation for Column Metadata
> > >
> > > This boundary helps determine the segment in which a column's metadata
> > > resides.
> > >         For example:
> > >
> > >         Max columns in the first segment = 5000
> > >         Buffer cache page size = 32KB
> > >
> > >         If each column's metadata takes X bytes, then the number of
> > > columns per page/segment is:
> > >         R = 32KB / X
> > >         To find which segment column I resides in:
> > >
> > >         If I < 5000: it's in segment 0
> > >         If I ≥ 5000: it's in
> > >         segment = ((I - 5000) / R) + 1
> > >
> > > On 2025/06/26 23:22:01 Taewoo Kim wrote:
> > > > +1
> > > >
> > > > Q: What's the main point of having the "Max Column Index"?
> > > >
> > > > Best,
> > > > Taewoo
> > > >
> > > >
> > > > On Thu, Jun 26, 2025 at 3:33 PM Ian Maxon <ima...@uci.edu> wrote:
> > > >
> > > > > +1
> > > > >
> > > > > On Thu, Jun 26, 2025 at 3:08 PM Mike Carey <dtab...@gmail.com>
> wrote:
> > > > > >
> > > > > > This looks excellent!
> > > > > >
> > > > > > +1 for adopting this extension to our storage system ASAP.
> > > > > >
> > > > > > On 6/26/25 11:53 AM, Ritik Raj wrote:
> > > > > > > *Expanding PageZero to Support Unlimited Columns*
> > > > > > > APE:
> > > > > > >
> > > > >
> > >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/ASTERIXDB/APE*23*3A*Unlimited*Columns*Support__;KyUrKys!!CzAuKJ42GuquVTTmVmPViYEvSg!P1EHZSwq7hcpOlyHuy7R1F0lAkJK31elLGusrjb58xBVxuuNH4gxpVwKRuJSv9mByOtN5siVn5A6sQ$
> > > > > > >
> > > > > > > In the columnar storage format, each MegaPage represents a
> logical
> > > leaf
> > > > > > > node and begins with `PageZero`, a metadata section that
> captures
> > > > > essential
> > > > > > > column metadata including column offsets and min/max filters.
> > > > > Originally,
> > > > > > > `PageZero` was constrained to reside in a single page
> (typically
> > > > > 128KB),
> > > > > > > with a fixed layout that stored information for **every
> column**
> > > in the
> > > > > > > global schema.
> > > > > > >
> > > > > > > Each column entry consumed 4 bytes for offset and 16 bytes for
> a
> > > > > min/max
> > > > > > > filter, leading to a **metadata footprint of 20 bytes per
> column**.
> > > > > With
> > > > > > > this layout, the **maximum number of columns supported was
> capped
> > > at
> > > > > > > ~6,000**, given space constraints and the need to reserve part
> of
> > > > > > > `PageZero` for primary key metadata and structural headers.
> > > > > > >
> > > > > > > This limitation became problematic for datasets with **wide or
> > > sparse
> > > > > > > schemas**, where many columns may be missing in individual
> document
> > > > > batches
> > > > > > > but still occupy space in `PageZero`. The presence of unused
> > > metadata
> > > > > > > bloated the footprint and limited scalability.
> > > > > > >
> > > > > > > *Multi-Segment PageZero: Motivation and Layout*
> > > > > > >
> > > > > > > To overcome this limitation, we introduce **multi-segment
> support
> > > in
> > > > > > > PageZero**. Instead of storing all metadata in a single fixed
> > > block, we
> > > > > > > partition PageZero into multiple **segments**, with the **first
> > > > > (zeroth)
> > > > > > > segment storing primary key metadata and as many column
> entries as
> > > it
> > > > > can
> > > > > > > fit**, and subsequent segments storing the remaining metadata.
> > > > > > >
> > > > > > > Each segment follows the same layout: column index → offset →
> min →
> > > > > max,
> > > > > > > stored in an interleaved manner. This structure ensures
> efficient
> > > scan
> > > > > and
> > > > > > > lookup, while enabling us to scale to **arbitrarily many
> columns**,
> > > > > bounded
> > > > > > > only by MegaPage size.
> > > > > > >
> > > > > > > *Segment Layout:*
> > > > > > >
> > > > > > > ```
> > > > > > > [ Segment Header ]
> > > > > > >   ├─ Number of Columns
> > > > > > >   ├─ Max Column Index in Segment
> > > > > > > [ Interleaved Metadata Entries ]
> > > > > > >   ├─ ColumnIndex₁, Offset₁, Min₁, Max₁
> > > > > > >   ├─ ColumnIndex₂, Offset₂, Min₂, Max₂
> > > > > > >   └─ ...
> > > > > > > ```
> > > > > > >
> > > > > > > A new `DefaultColumnMultiPageZeroWriter` class was introduced
> to
> > > manage
> > > > > > > this segmented layout. It delegates metadata writing to
> individual
> > > > > segments
> > > > > > > while maintaining headers at the top-level for navigation.
> > > > > > >
> > > > > > > *Adaptive Writer Selection*
> > > > > > >
> > > > > > > To avoid burdening all batches with this segmented structure,
> we
> > > > > retain the
> > > > > > > `DefaultColumnPageZeroWriter` for small or dense schemas. A new
> > > > > **adaptive
> > > > > > > selection mechanism** compares space usage of both writers for
> a
> > > batch
> > > > > and
> > > > > > > picks the optimal one.
> > > > > > >
> > > > > > > The decision logic weighs:
> > > > > > > - Space taken by Default Multi-segment writer (fixed layout
> for all
> > > > > columns)
> > > > > > > - Space taken by Sparse Multi-Segment writer (compact layout
> for
> > > > > present
> > > > > > > columns)
> > > > > > >
> > > > > > > This logic is encapsulated in `PageZeroWriterFlavorSelector`.
> > > > > > >
> > > > > > > *New Configuration Options:*
> > > > > > >
> > > > > > > Two new storage configuration parameters have been introduced:
> > > > > > >
> > > > > > > 1. **`STORAGE_MAX_COLUMNS_IN_ZEROTH_SEGMENT`**
> > > (`INTEGER_BYTE_UNIT`,
> > > > > > > default: `5000`)
> > > > > > >     Controls the maximum number of columns that can be stored
> in
> > > the
> > > > > zeroth
> > > > > > > segment of `PageZero`. Remaining columns, if any, are
> offloaded to
> > > > > > > additional segments. This helps balance lookup performance
> (fast
> > > for
> > > > > zeroth
> > > > > > > segment) and scalability. This might change based on perf
> > > experiments.
> > > > > > >
> > > > > > > 2. **`STORAGE_PAGE_ZERO_WRITER`** (`STRING`, default:
> `"default"`)
> > > > > > >     Controls the writer strategy used during flush. Accepted
> values
> > > > > are:
> > > > > > >     - `"default"`: Always use the legacy writer.
> > > > > > >     - `"sparse"`: Always use the sparse writer (only present
> > > columns).
> > > > > > >     - `"adaptive"`: Dynamically compare both and pick the
> writer
> > > that
> > > > > uses
> > > > > > > less space.
> > > > > > >
> > > > > > > *Summary of Changes*
> > > > > > >
> > > > > > > - Interleaved layout per segment for columnIndex, offset, min,
> max.
> > > > > > > - Logic to estimate the number of segments and assign columns
> to
> > > > > segments.
> > > > > > > - Writer is selected dynamically using
> > > `PageZeroWriterFlavorSelector`.
> > > > > > >
> > > > > > > *Benefits*
> > > > > > >
> > > > > > > - Unlocks support for **tens of thousands of columns** per
> > > MegaPage.
> > > > > > > - Better space efficiency for sparse batches.
> > > > > > > - Retains backward compatibility: Already ingested MegaLeafs
> can
> > > also
> > > > > be
> > > > > > > read.
> > > > > > >
> > > > > > > This change is essential for evolving workloads that
> increasingly
> > > rely
> > > > > on
> > > > > > > flexible schemas and sparse data layouts.
> > > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS][APE] Supporting Unlimited Columns

Reply via email to