This looks excellent!

+1 for adopting this extension to our storage system ASAP.

On 6/26/25 11:53 AM, Ritik Raj wrote:
*Expanding PageZero to Support Unlimited Columns*
APE:
https://cwiki.apache.org/confluence/display/ASTERIXDB/APE+23%3A+Unlimited+Columns+Support

In the columnar storage format, each MegaPage represents a logical leaf
node and begins with `PageZero`, a metadata section that captures essential
column metadata including column offsets and min/max filters. Originally,
`PageZero` was constrained to reside in a single page (typically 128KB),
with a fixed layout that stored information for **every column** in the
global schema.

Each column entry consumed 4 bytes for offset and 16 bytes for a min/max
filter, leading to a **metadata footprint of 20 bytes per column**. With
this layout, the **maximum number of columns supported was capped at
~6,000**, given space constraints and the need to reserve part of
`PageZero` for primary key metadata and structural headers.

This limitation became problematic for datasets with **wide or sparse
schemas**, where many columns may be missing in individual document batches
but still occupy space in `PageZero`. The presence of unused metadata
bloated the footprint and limited scalability.

*Multi-Segment PageZero: Motivation and Layout*

To overcome this limitation, we introduce **multi-segment support in
PageZero**. Instead of storing all metadata in a single fixed block, we
partition PageZero into multiple **segments**, with the **first (zeroth)
segment storing primary key metadata and as many column entries as it can
fit**, and subsequent segments storing the remaining metadata.

Each segment follows the same layout: column index → offset → min → max,
stored in an interleaved manner. This structure ensures efficient scan and
lookup, while enabling us to scale to **arbitrarily many columns**, bounded
only by MegaPage size.

*Segment Layout:*

```
[ Segment Header ]
  ├─ Number of Columns
  ├─ Max Column Index in Segment
[ Interleaved Metadata Entries ]
  ├─ ColumnIndex₁, Offset₁, Min₁, Max₁
  ├─ ColumnIndex₂, Offset₂, Min₂, Max₂
  └─ ...
```

A new `DefaultColumnMultiPageZeroWriter` class was introduced to manage
this segmented layout. It delegates metadata writing to individual segments
while maintaining headers at the top-level for navigation.

*Adaptive Writer Selection*

To avoid burdening all batches with this segmented structure, we retain the
`DefaultColumnPageZeroWriter` for small or dense schemas. A new **adaptive
selection mechanism** compares space usage of both writers for a batch and
picks the optimal one.

The decision logic weighs:
- Space taken by Default Multi-segment writer (fixed layout for all columns)
- Space taken by Sparse Multi-Segment writer (compact layout for present
columns)

This logic is encapsulated in `PageZeroWriterFlavorSelector`.

*New Configuration Options:*

Two new storage configuration parameters have been introduced:

1. **`STORAGE_MAX_COLUMNS_IN_ZEROTH_SEGMENT`** (`INTEGER_BYTE_UNIT`,
default: `5000`)
    Controls the maximum number of columns that can be stored in the zeroth
segment of `PageZero`. Remaining columns, if any, are offloaded to
additional segments. This helps balance lookup performance (fast for zeroth
segment) and scalability. This might change based on perf experiments.

2. **`STORAGE_PAGE_ZERO_WRITER`** (`STRING`, default: `"default"`)
    Controls the writer strategy used during flush. Accepted values are:
    - `"default"`: Always use the legacy writer.
    - `"sparse"`: Always use the sparse writer (only present columns).
    - `"adaptive"`: Dynamically compare both and pick the writer that uses
less space.

*Summary of Changes*

- Interleaved layout per segment for columnIndex, offset, min, max.
- Logic to estimate the number of segments and assign columns to segments.
- Writer is selected dynamically using `PageZeroWriterFlavorSelector`.

*Benefits*

- Unlocks support for **tens of thousands of columns** per MegaPage.
- Better space efficiency for sparse batches.
- Retains backward compatibility: Already ingested MegaLeafs can also be
read.

This change is essential for evolving workloads that increasingly rely on
flexible schemas and sparse data layouts.

Reply via email to