+1

This APE is super important since JSON lets users do "stupid" things like have one of the pieces of information in the objects of a collection have a monotonically increasing name, e.g., using a timestamp as a key and then an observation (e.g., temperature) as the associated value.  :-)  Such a collection will have a never-ending increasing set of "columns" (name-wise) that are each used just once.  Ouch!

On 6/12/25 8:06 AM, Ian Maxon wrote:
+1, this APE is really cool and is a great solution to tricky
situations like objects with generated field names.

On Thu, Jun 12, 2025 at 2:26 AM Ritik Raj<ri...@apache.org> wrote:
During data ingestion or upsert operations, documents are flushed to disk
in batches, creating disk components. In the columnar storage format, each
MegaPage, which represents a leaf logically, begins with a single page
metadata section called `PageZero`.

Currently, `PageZero` stores metadata for every column in the global
schema, even if a column is not present in the documents of the current
batch. This metadata includes a 4-byte offset and a 16-byte filter (min/max
values) per column. This approach leads to significant overhead, especially
for datasets with sparse or wide schemas. The 128KB default size limit of
`PageZero` imposes a practical maximum of approximately 6,500 columns,
which is further reduced in practice by the space required for primary keys.

The proposed enhancement introduces an efficient "Sparse PageZero writer".
This writer's design is to only store metadata for the subset of columns
that are actually present in the current batch of documents being flushed,
plus any others required for correct column assembly (e.g., in union types
or nested structures). This reduces metadata overhead, enabling support for
schemas with a larger number of sparse columns within the existing
`PageZero` size constraint.

Risks and trade-offs include a potential performance impact. The sparse
format requires PageReaders to perform a binary search to look up column
offsets and filters, rather than a direct index lookup, which introduces
CPU overhead. There is also a minor computational overhead from the column
estimation logic.

An alternative is the existing "Default" writer. The proposal includes an
"Adaptive" mode that dynamically evaluates both the Default and Sparse
writers for an incoming batch and selects the one that consumes the least
space.

A limitation of this proposal is that the `PageZero` size remains
constrained to one page, default 128KB, so the hard limit on the number of
columns in a single MegaPage remains ~6,500 by default. This is not removed
by this change.

This APE[
https://urldefense.com/v3/__https://cwiki.apache.org/confluence/pages/viewpage.action?spaceKey=ASTERIXDB&title=APE*22*3A*Sparse*column*metadata*storage__;KyUrKysr!!CzAuKJ42GuquVTTmVmPViYEvSg!LYuLE2dTTaSmQqZFBtmfGfDeQ6wK34-3HPl68Ji6x8BUouvD48jjKQcklfT4D9zoHj615BA1DUAR$
 ]
would introduce a new "Sparse PageZero writer" that writes metadata for
only the subset of columns present in a given batch.

The source code changes are summarized as follows:
*   A new `PageZero Writer Mode` configuration option will be added with
three possible values:
     *   "*Default*": Always uses the current writer.
     *   "*Sparse*": Always uses the new sparse writer.
     *   "*Adaptive*": Dynamically compares the space usage of both writers
for an incoming batch and selects the one that results in a smaller
`PageZero`.
*   The sparse layout will store `columnIndex`, `columnOffset`, and
`columnFilter` for each present column.
*   Logic will be added to determine the minimum required set of columns
for a batch, accounting for schema evolution, unions, and nested structures
to ensure correct record assembly.

The change is controlled by a new configuration option. Existing disk
components created with the default writer will coexist with new
components. Since the global metadata is maintained at the index level and
used by the column assembler to reconstruct records, the system will be
able to read from components created with either writer, ensuring backward
compatibility.

The following areas will be tested to validate the change:

*Performance Testing*:
Once a prototype is available, performance testing should be done to
evaluate the trade-offs:
1.  *Indirect Column Lookup*: Measure the CPU overhead introduced by using
binary search to locate column offsets and filters.
2.  *Column Estimation Overhead*: Measure the computational cost of the
column estimation step.

*Functional Testing*:
1.  *Default Writer Validation*: Run the existing test suite with documents
containing most or all fields to ensure the default writer's behavior is
unchanged.
2.  *Sparse Writer Validation*: Design a new test suite with batches of
sparse documents (where each batch contains a subset of fields) to verify
that the `SparsePageZeroWriter` produces smaller disk components. Tests
will be constructed with a column set less than or equal to the 6,500
column limit.
3.  *Correctness Checks*: For both writers, compare query results with row
format collections to ensure correctness, paying special attention to
missing fields, null values, and nested structures (arrays, objects,
unions).

Reply via email to