On Tue, Apr 9, 2013 at 8:22 AM, Wei Wang <welshw...@gmail.com> wrote:
> DocValues makes fast per doc value lookup possible, which is nice. But it > brings other interesting issues. > > Assume there are 100M docs and 200 NumericDocValuesFields, this ends up > with huge number of disk and memory usage, even if there are just thousands > of values for each field. I guess this is because Lucene stores a value for > each DocValues field of each document, with variable-length codec. > > So in such scenario, is it possible only store values for the DocValues > field of the docment that actually has a value for that field? Or does > Lucene has a column storage mechanism sort of like hash map for DocValues: > This really depends on the details of the Codec's encoding. So if its important to you the easiest way is to write a codec that compresses things the way you want (e.g. uses two-stage table/block RLE/something like that). Maybe this would be a good contribution to add to the codecs/ module of lucene so other people could use it too. I think some of the codecs might do this for some types: maybe just as an accidental side effect of their current compression/encoding (e.g. use of BlockPackedWriter). But its not something we really optimize for: as it doesn't make sense for a lot of use cases like scoring factors or faceting for docvalues. For example if you want to facet on tons of sparse fields, its probably better to use lucene's faceting module with uses one combined docvalues field for the document... I think.