Hi all, I'd like to propose adding a column-oriented document-ingestion API to IndexWriter and get early feedback on the shape before opening a PR. I've been prototyping this on a branch and would like to understand community appetite before pushing further.
https://github.com/apache/lucene/pull/15990 ## The concept Today IndexWriter consumes an Iterable<IndexableField> per document: the indexing chain walks each field, re-resolves FieldInfo / PerField state, revalidates the field type against the schema, and interleaves stored-fields, postings, doc-values and points per document. The proposal is to add a parallel intake path: IndexWriter.addBatch(ColumnBatch). A ColumnBatch exposes a set of Columns, where each Column represents one field across all documents in the batch. The indexing chain then processes the batch in two passes: 1. A row-oriented pass for stored fields and the inverted index (per-doc processing still matters there). 2. A column-oriented pass for doc values, vectors, and points (where per-field bulk writes are a natural fit). Column itself is just metadata (name, IndexableFieldType, density). Iteration happens through typed cursors obtained from the subclasses: LongColumn for numeric DV, 1-D points, and numeric stored; BinaryColumn for binary/sorted DV, text/binary stored, and binary-encoded points; and VectorColumn for KNN vectors. Each cursor call returns a fresh cursor, so a column can be traversed once in the row pass and again in the column pass. ## Two benefits motivate this: 1. More compact in-memory representation during indexing. A column batch avoids the per-field allocations of the document-at-a-time path (IndexableField instances, per-doc FieldType references, per-doc attribute maps). For numeric DV and points in particular, the caller can hand us a primitive-backed cursor that the chain drains directly into PackedLongValues / the points writer without indirection. 2. Less redundant field validation. Field name, type, indexing options, and schema compatibility are resolved once per column instead of once per IndexableField. For workloads where a caller already knows the schema of a batch, that revalidation is pure overhead. All in all, these changes drop CPU usage dedicated to IndexWriter#addDocuments 4-5x for analytic heavy workloads. No changes to on-disk format; this is an ingestion-side API only. ## MVP: sparse columns The minimum useful version is sparse-only: every column is allowed to skip doc-ids or have multiple values per doc-id, and the chain goes through the same per-doc paths it uses today (just driven by a cursor instead of an IndexableField stream). This is enough to land the API, the two-pass consumer, and the public addBatch entry point without touching the doc-values / points writers. ## Follow-on option: dense columns The bigger performance wins come from advertising a column as dense — every doc in [0, numDocs) has exactly one value. That lets the chain: - Skip the sparse-bitset bookkeeping in NumericDocValuesWriter / SortedNumericDocValuesWriter entirely on the dense path. - Bulk-fill straight into PackedLongValues from the column's values() cursor, avoiding the per-value add loop. - For 1-D numeric points, feed the BKD writer from the same dense primitive cursor instead of one BytesRef at a time. - For n-D numeric points, a fixed size binary column could feed multiple document points in a single write. This is an expert scenario as users have to serialize the points properly in sort order in the column. Density is asserted by the column up-front so the chain can pick the path without probing. ## Follow-on option: Ergonomic builders I have focused on very low-level apis (abstract long and byte columns implemented by users). Lucene could eventually add builders to create columns easier (similar to IntField, LongField, etc). ## Indexed-only terms ("DOCS + no norms") as a column One more case worth flagging: fields indexed with IndexOptions.DOCS and no norms — keyword/filter-style fields — don't need per-doc TokenStream plumbing. A BinaryColumn over such a field can feed the postings writer directly (one BytesRef per doc, no analysis, no norm accumulation). I have not implemented this in my POC. ## Scope of the initial proposal - New package org.apache.lucene.document.column with ColumnBatch, Column, LongColumn, BinaryColumn, and their cursors. - New IndexWriter.addBatch(ColumnBatch) returning a seqno, plumbed through DocumentsWriter / DocumentsWriterPerThread. - Indexing-chain changes to support the two-pass consumer. - All marked @lucene.experimental. - Try to implement as much of the column oriented processing in the column package to keep things experimental as long as possible. Would love feedback on if this is something Lucene is interested in or would be open to. It would help significantly in the analytical case and remove significant indirection and memory usage amplification on the per-field allocations. Thanks, Tim -- Tim Brooks [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
