Hi all,

I'd like to propose adding a column-oriented document-ingestion API to 
IndexWriter and get early feedback on the shape before opening a PR. I've been 
prototyping this on a branch and would like to understand community appetite 
before pushing further.

https://github.com/apache/lucene/pull/15990

## The concept

Today IndexWriter consumes an Iterable<IndexableField> per document: the 
indexing chain walks each field, re-resolves FieldInfo / PerField state, 
revalidates the field type against the schema, and interleaves stored-fields, 
postings, doc-values and points per document.

The proposal is to add a parallel intake path: 
IndexWriter.addBatch(ColumnBatch). A ColumnBatch exposes a set of Columns, 
where each Column represents one field across all documents in the batch. The 
indexing chain then processes the batch in two passes:

1. A row-oriented pass for stored fields and the inverted index (per-doc 
processing still matters there).
2. A column-oriented pass for doc values, vectors, and points (where per-field 
bulk writes are a natural fit).

Column itself is just metadata (name, IndexableFieldType, density). Iteration 
happens through typed cursors obtained from the subclasses: LongColumn for 
numeric DV, 1-D points, and numeric stored; BinaryColumn for binary/sorted DV, 
text/binary stored, and binary-encoded points; and VectorColumn for KNN 
vectors. Each cursor call returns a fresh cursor, so a column can be traversed 
once in the row pass and again in the column pass.

## Two benefits motivate this:

1. More compact in-memory representation during indexing. A column batch avoids 
the per-field allocations of the document-at-a-time path (IndexableField 
instances, per-doc FieldType references, per-doc attribute maps). For numeric 
DV and points in particular, the caller can hand us a primitive-backed cursor 
that the chain drains directly into PackedLongValues / the points writer 
without indirection.
2. Less redundant field validation. Field name, type, indexing options, and 
schema compatibility are resolved once per column instead of once per 
IndexableField. For workloads where a caller already knows the schema of a 
batch, that revalidation is pure overhead.

All in all, these changes drop CPU usage dedicated to IndexWriter#addDocuments 
4-5x for analytic heavy workloads.

No changes to on-disk format; this is an ingestion-side API only.

## MVP: sparse columns

The minimum useful version is sparse-only: every column is allowed to skip 
doc-ids or have multiple values per doc-id, and the chain goes through the same 
per-doc paths it uses today (just driven by a cursor instead of an 
IndexableField stream). This is enough to land the API, the two-pass consumer, 
and the public addBatch entry point without touching the doc-values / points 
writers.

## Follow-on option: dense columns

The bigger performance wins come from advertising a column as dense — every doc 
in [0, numDocs) has exactly one value. That lets the chain:

- Skip the sparse-bitset bookkeeping in NumericDocValuesWriter / 
SortedNumericDocValuesWriter entirely on the dense path.
- Bulk-fill straight into PackedLongValues from the column's values() cursor, 
avoiding the per-value add loop. 
- For 1-D numeric points, feed the BKD writer from the same dense primitive 
cursor instead of one BytesRef at a time.
- For n-D numeric points, a fixed size binary column could feed multiple 
document points in a single write. This is an expert scenario as users have to 
serialize the points properly in sort order in the column.

Density is asserted by the column up-front so the chain can pick the path 
without probing.

## Follow-on option: Ergonomic builders

I have focused on very low-level apis (abstract long and byte columns 
implemented by users). Lucene could eventually add builders to create columns 
easier (similar to IntField, LongField, etc).

## Indexed-only terms ("DOCS + no norms") as a column

One more case worth flagging: fields indexed with IndexOptions.DOCS and no 
norms — keyword/filter-style fields — don't need per-doc TokenStream plumbing. 
A BinaryColumn over such a field can feed the postings writer directly (one 
BytesRef per doc, no analysis, no norm accumulation). I have not implemented 
this in my POC.

## Scope of the initial proposal

- New package org.apache.lucene.document.column with ColumnBatch, Column, 
LongColumn, BinaryColumn, and their cursors.
- New IndexWriter.addBatch(ColumnBatch) returning a seqno, plumbed through 
DocumentsWriter / DocumentsWriterPerThread.
- Indexing-chain changes to support the two-pass consumer.
- All marked @lucene.experimental.
- Try to implement as much of the column oriented processing in the column 
package to keep things experimental as long as possible.

Would love feedback on if this is something Lucene is interested in or would be 
open to. It would help significantly in the analytical case and remove 
significant indirection and memory usage amplification on the per-field 
allocations.

Thanks,
Tim

--
  Tim Brooks
  [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to