Re: [DISCUSS] Column-oriented indexing API for IndexWriter

Tim Brooks Sat, 02 May 2026 19:18:21 -0700

I'll send some full results in a few days. 

I did a first pass and the improvements were pretty minor for the sparse column 
variant. After investigating a bit that benchmark still has a considerable 
number of inverted index fields which drops back to row processing. When I 
switched all fields to docvalues the gains were in the 15-20% range. This is 
still much smaller than what we are seeing (2-4X from sparse to dense). I 
suspect it is because a considerable amount of the lunceneutils indexing 
benchmarks are consumed by reading from the file, parsing date times, etc.


I'll investigate a bit more and share the more specific results and the changes 
I made to the benchmark I made to surface the docvalue/points oriented 
improvements. I'll also share some details on I'm see in my macrobenchmarks 
where the gains are larger.

--
  Tim Brooks
  [email protected]

On Thu, Apr 30, 2026, at 1:47 PM, Adrien Grand wrote:
> Very cool! I remember wanting something like this when I was looking into 
> making Lucene a bit better at ingesting small structured documents like 
> IndexGeoNames 
> (https://github.com/mikemccand/luceneutil/blob/c530a720329bba774fefdadd17e027187845d100/src/extra/perf/IndexGeoNames.java).
>  Is your POC complete enough to get a sense of the speedup that we'd get on 
> this benchmark?
> 
> On Tue, Apr 28, 2026 at 3:53 AM Tim Brooks <[email protected]> wrote:
>> Hi all,
>> 
>> I'd like to propose adding a column-oriented document-ingestion API to 
>> IndexWriter and get early feedback on the shape before opening a PR. I've 
>> been prototyping this on a branch and would like to understand community 
>> appetite before pushing further.
>> 
>> https://github.com/apache/lucene/pull/15990
>> 
>> ## The concept
>> 
>> Today IndexWriter consumes an Iterable<IndexableField> per document: the 
>> indexing chain walks each field, re-resolves FieldInfo / PerField state, 
>> revalidates the field type against the schema, and interleaves 
>> stored-fields, postings, doc-values and points per document.
>> 
>> The proposal is to add a parallel intake path: 
>> IndexWriter.addBatch(ColumnBatch). A ColumnBatch exposes a set of Columns, 
>> where each Column represents one field across all documents in the batch. 
>> The indexing chain then processes the batch in two passes:
>> 
>> 1. A row-oriented pass for stored fields and the inverted index (per-doc 
>> processing still matters there).
>> 2. A column-oriented pass for doc values, vectors, and points (where 
>> per-field bulk writes are a natural fit).
>> 
>> Column itself is just metadata (name, IndexableFieldType, density). 
>> Iteration happens through typed cursors obtained from the subclasses: 
>> LongColumn for numeric DV, 1-D points, and numeric stored; BinaryColumn for 
>> binary/sorted DV, text/binary stored, and binary-encoded points; and 
>> VectorColumn for KNN vectors. Each cursor call returns a fresh cursor, so a 
>> column can be traversed once in the row pass and again in the column pass.
>> 
>> ## Two benefits motivate this:
>> 
>> 1. More compact in-memory representation during indexing. A column batch 
>> avoids the per-field allocations of the document-at-a-time path 
>> (IndexableField instances, per-doc FieldType references, per-doc attribute 
>> maps). For numeric DV and points in particular, the caller can hand us a 
>> primitive-backed cursor that the chain drains directly into PackedLongValues 
>> / the points writer without indirection.
>> 2. Less redundant field validation. Field name, type, indexing options, and 
>> schema compatibility are resolved once per column instead of once per 
>> IndexableField. For workloads where a caller already knows the schema of a 
>> batch, that revalidation is pure overhead.
>> 
>> All in all, these changes drop CPU usage dedicated to 
>> IndexWriter#addDocuments 4-5x for analytic heavy workloads.
>> 
>> No changes to on-disk format; this is an ingestion-side API only.
>> 
>> ## MVP: sparse columns
>> 
>> The minimum useful version is sparse-only: every column is allowed to skip 
>> doc-ids or have multiple values per doc-id, and the chain goes through the 
>> same per-doc paths it uses today (just driven by a cursor instead of an 
>> IndexableField stream). This is enough to land the API, the two-pass 
>> consumer, and the public addBatch entry point without touching the 
>> doc-values / points writers.
>> 
>> ## Follow-on option: dense columns
>> 
>> The bigger performance wins come from advertising a column as dense — every 
>> doc in [0, numDocs) has exactly one value. That lets the chain:
>> 
>> - Skip the sparse-bitset bookkeeping in NumericDocValuesWriter / 
>> SortedNumericDocValuesWriter entirely on the dense path.
>> - Bulk-fill straight into PackedLongValues from the column's values() 
>> cursor, avoiding the per-value add loop. 
>> - For 1-D numeric points, feed the BKD writer from the same dense primitive 
>> cursor instead of one BytesRef at a time.
>> - For n-D numeric points, a fixed size binary column could feed multiple 
>> document points in a single write. This is an expert scenario as users have 
>> to serialize the points properly in sort order in the column.
>> 
>> Density is asserted by the column up-front so the chain can pick the path 
>> without probing.
>> 
>> ## Follow-on option: Ergonomic builders
>> 
>> I have focused on very low-level apis (abstract long and byte columns 
>> implemented by users). Lucene could eventually add builders to create 
>> columns easier (similar to IntField, LongField, etc).
>> 
>> ## Indexed-only terms ("DOCS + no norms") as a column
>> 
>> One more case worth flagging: fields indexed with IndexOptions.DOCS and no 
>> norms — keyword/filter-style fields — don't need per-doc TokenStream 
>> plumbing. A BinaryColumn over such a field can feed the postings writer 
>> directly (one BytesRef per doc, no analysis, no norm accumulation). I have 
>> not implemented this in my POC.
>> 
>> ## Scope of the initial proposal
>> 
>> - New package org.apache.lucene.document.column with ColumnBatch, Column, 
>> LongColumn, BinaryColumn, and their cursors.
>> - New IndexWriter.addBatch(ColumnBatch) returning a seqno, plumbed through 
>> DocumentsWriter / DocumentsWriterPerThread.
>> - Indexing-chain changes to support the two-pass consumer.
>> - All marked @lucene.experimental.
>> - Try to implement as much of the column oriented processing in the column 
>> package to keep things experimental as long as possible.
>> 
>> Would love feedback on if this is something Lucene is interested in or would 
>> be open to. It would help significantly in the analytical case and remove 
>> significant indirection and memory usage amplification on the per-field 
>> allocations.
>> 
>> Thanks,
>> Tim
>> 
>> --
>>   Tim Brooks
>>   [email protected]
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
> 
> 
> --
> Adrien

Re: [DISCUSS] Column-oriented indexing API for IndexWriter

Reply via email to