I'll send some full results in a few days. I did a first pass and the improvements were pretty minor for the sparse column variant. After investigating a bit that benchmark still has a considerable number of inverted index fields which drops back to row processing. When I switched all fields to docvalues the gains were in the 15-20% range. This is still much smaller than what we are seeing (2-4X from sparse to dense). I suspect it is because a considerable amount of the lunceneutils indexing benchmarks are consumed by reading from the file, parsing date times, etc.
I'll investigate a bit more and share the more specific results and the changes I made to the benchmark I made to surface the docvalue/points oriented improvements. I'll also share some details on I'm see in my macrobenchmarks where the gains are larger. -- Tim Brooks [email protected] On Thu, Apr 30, 2026, at 1:47 PM, Adrien Grand wrote: > Very cool! I remember wanting something like this when I was looking into > making Lucene a bit better at ingesting small structured documents like > IndexGeoNames > (https://github.com/mikemccand/luceneutil/blob/c530a720329bba774fefdadd17e027187845d100/src/extra/perf/IndexGeoNames.java). > Is your POC complete enough to get a sense of the speedup that we'd get on > this benchmark? > > On Tue, Apr 28, 2026 at 3:53 AM Tim Brooks <[email protected]> wrote: >> Hi all, >> >> I'd like to propose adding a column-oriented document-ingestion API to >> IndexWriter and get early feedback on the shape before opening a PR. I've >> been prototyping this on a branch and would like to understand community >> appetite before pushing further. >> >> https://github.com/apache/lucene/pull/15990 >> >> ## The concept >> >> Today IndexWriter consumes an Iterable<IndexableField> per document: the >> indexing chain walks each field, re-resolves FieldInfo / PerField state, >> revalidates the field type against the schema, and interleaves >> stored-fields, postings, doc-values and points per document. >> >> The proposal is to add a parallel intake path: >> IndexWriter.addBatch(ColumnBatch). A ColumnBatch exposes a set of Columns, >> where each Column represents one field across all documents in the batch. >> The indexing chain then processes the batch in two passes: >> >> 1. A row-oriented pass for stored fields and the inverted index (per-doc >> processing still matters there). >> 2. A column-oriented pass for doc values, vectors, and points (where >> per-field bulk writes are a natural fit). >> >> Column itself is just metadata (name, IndexableFieldType, density). >> Iteration happens through typed cursors obtained from the subclasses: >> LongColumn for numeric DV, 1-D points, and numeric stored; BinaryColumn for >> binary/sorted DV, text/binary stored, and binary-encoded points; and >> VectorColumn for KNN vectors. Each cursor call returns a fresh cursor, so a >> column can be traversed once in the row pass and again in the column pass. >> >> ## Two benefits motivate this: >> >> 1. More compact in-memory representation during indexing. A column batch >> avoids the per-field allocations of the document-at-a-time path >> (IndexableField instances, per-doc FieldType references, per-doc attribute >> maps). For numeric DV and points in particular, the caller can hand us a >> primitive-backed cursor that the chain drains directly into PackedLongValues >> / the points writer without indirection. >> 2. Less redundant field validation. Field name, type, indexing options, and >> schema compatibility are resolved once per column instead of once per >> IndexableField. For workloads where a caller already knows the schema of a >> batch, that revalidation is pure overhead. >> >> All in all, these changes drop CPU usage dedicated to >> IndexWriter#addDocuments 4-5x for analytic heavy workloads. >> >> No changes to on-disk format; this is an ingestion-side API only. >> >> ## MVP: sparse columns >> >> The minimum useful version is sparse-only: every column is allowed to skip >> doc-ids or have multiple values per doc-id, and the chain goes through the >> same per-doc paths it uses today (just driven by a cursor instead of an >> IndexableField stream). This is enough to land the API, the two-pass >> consumer, and the public addBatch entry point without touching the >> doc-values / points writers. >> >> ## Follow-on option: dense columns >> >> The bigger performance wins come from advertising a column as dense — every >> doc in [0, numDocs) has exactly one value. That lets the chain: >> >> - Skip the sparse-bitset bookkeeping in NumericDocValuesWriter / >> SortedNumericDocValuesWriter entirely on the dense path. >> - Bulk-fill straight into PackedLongValues from the column's values() >> cursor, avoiding the per-value add loop. >> - For 1-D numeric points, feed the BKD writer from the same dense primitive >> cursor instead of one BytesRef at a time. >> - For n-D numeric points, a fixed size binary column could feed multiple >> document points in a single write. This is an expert scenario as users have >> to serialize the points properly in sort order in the column. >> >> Density is asserted by the column up-front so the chain can pick the path >> without probing. >> >> ## Follow-on option: Ergonomic builders >> >> I have focused on very low-level apis (abstract long and byte columns >> implemented by users). Lucene could eventually add builders to create >> columns easier (similar to IntField, LongField, etc). >> >> ## Indexed-only terms ("DOCS + no norms") as a column >> >> One more case worth flagging: fields indexed with IndexOptions.DOCS and no >> norms — keyword/filter-style fields — don't need per-doc TokenStream >> plumbing. A BinaryColumn over such a field can feed the postings writer >> directly (one BytesRef per doc, no analysis, no norm accumulation). I have >> not implemented this in my POC. >> >> ## Scope of the initial proposal >> >> - New package org.apache.lucene.document.column with ColumnBatch, Column, >> LongColumn, BinaryColumn, and their cursors. >> - New IndexWriter.addBatch(ColumnBatch) returning a seqno, plumbed through >> DocumentsWriter / DocumentsWriterPerThread. >> - Indexing-chain changes to support the two-pass consumer. >> - All marked @lucene.experimental. >> - Try to implement as much of the column oriented processing in the column >> package to keep things experimental as long as possible. >> >> Would love feedback on if this is something Lucene is interested in or would >> be open to. It would help significantly in the analytical case and remove >> significant indirection and memory usage amplification on the per-field >> allocations. >> >> Thanks, >> Tim >> >> -- >> Tim Brooks >> [email protected] >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] > > > -- > Adrien
