[
https://issues.apache.org/jira/browse/LUCENE-8689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Mamontov updated LUCENE-8689:
----------------------------------
Description:
To avoid issues where some products become available/unavailable at some point
in time after being out-of-stock, e-commerce search system designers need to
embed up-to-date information about inventory availability right into the search
engines. Key requirement is to be able to accurately filter out unavailable
products and use availability as one of ranking signals. However, keeping
availability data up-to-date is a non-trivial task. Straightforward
implementation based on a partial updates of Lucene documents causes Solr cache
trashing with negatively affected query performance and resource utilization.
As an alternative solution we can use DocValues and build-in in-place updates
where field values can be independently updated without touching inverted
index, and while filtering by DocValues is a bit slower, overall performance
gain is better. However existing long based docValues are not sufficiently
optimized for carrying boolean inventory availability data:
* All DocValues queries are internally rewritten into
org.apache.lucene.search.DocValuesNumbersQuery which is based on direct
iteration over all column values and typically much slower than using
TermsQuery.
* On every commit/merge codec has to iterate over DocValues a couple times in
order to choose ths best compression algorithm suitable for given data. As a
result for 4K fields and 3M max doc merge takes more than 10 minutes
This issue is intended to solve these limitations via special bitwise doc
values format that uses internal representation of
org.apache.lucene.util.FixedBitSet in order to store indexed values and load
them at search time as a simple long array without additional decoding. There
are several reasons for this:
* At index time encoding is super fast without superfluous iterations over all
values to choose ths best compression algorithm suitable for given data.</li>
* At query time decoding is also simple and fast, no GC pressure and extra
steps
* Internal representation allows to perform random access in constant time
Limitations are:
* Does not support non boolean fields
* Boolean fields must be represented as long values 1 for true and 0 for false
* Current implementation does not support advanced bit set formats like
org.apache.lucene.util.SparseFixedBitSet or
org.apache.lucene.util.RoaringDocIdSet
In order to evaluate performance gain I've wrote a simple JMH based benchmark
[^SynteticDocValuesBench70.java] which allows to estimate a relative cost of
DF filters. This benchmark creates 2 000 000 documents with 5 boolean columns
with different density, where 10, 35, 50, 60 and 90 is an amount of documents
with value 1. Each method tries to enumerate over all values in synthetic store
field in all available ways:
* baseline – in almost all cases Solr uses FixedBitSet in filter cache to keep
store availability. This test just iterates over all bits.
* docValuesRaw – iterates over all values of DV column, the same code is used
in "post filtering", sorting and faceting.
* docValuesNumbersQuery – iterates over all values produced by query/filter
store:1, actually there is the only query implementation for DV based fields -
DocValuesNumbersQuery. This means that Lucene rewrites all term, range and
filter queries for non indexed filed into this fallback implementation.
* docValuesBooleanQuery – optimized variant of DocValuesNumbersQuery, which
support only two values – 0/1
!results2.png!
was:
To avoid issues where some products become available/unavailable at some point
in time after being out-of-stock, e-commerce search system designers need to
embed up-to-date information about inventory availability right into the search
engines. Key requirement is to be able to accurately filter out unavailable
products and use availability as one of ranking signals. However, keeping
availability data up-to-date is a non-trivial task. Straightforward
implementation based on a partial updates of Lucene documents causes Solr cache
trashing with negatively affected query performance and resource utilization.
As an alternative solution we can use DocValues and build-in in-place updates
where field values can be independently updated without touching inverted
index, and while filtering by DocValues is a bit slower, overall performance
gain is better. However existing long based docValues are not sufficiently
optimized for carrying boolean inventory availability data:
* All DocValues queries are internally rewritten into
org.apache.lucene.search.DocValuesNumbersQuery which is based on direct
iteration over all column values and typically much slower than using
TermsQuery.
* On every commit/merge codec has to iterate over DocValues a couple times in
order to choose ths best compression algorithm suitable for given data. As a
result for 4K fields and 3M max doc merge takes more than 10 minutes
This issue is intended to solve these limitations via special bitwise doc
values format that uses internal representation of
org.apache.lucene.util.FixedBitSet in order to store indexed values and load
them at search time as a simple long array without additional decoding. There
are several reasons for this:
* At index time encoding is super fast without superfluous iterations over all
values to choose ths best compression algorithm suitable for given data.</li>
* At query time decoding is also simple and fast, no GC pressure and extra
steps
* Internal representation allows to perform random access in constant time
Limitations are:
* Does not support non boolean fields
* Boolean fields must be represented as long values 1 for true and 0 for false
* Current implementation does not support advanced bit set formats like
org.apache.lucene.util.SparseFixedBitSet or
org.apache.lucene.util.RoaringDocIdSet
In order to evaluate performance gain I've wrote a simple JMH based benchmark
[^SynteticDocValuesBench70.java] which allows to estimate a relative cost of
DF filters. This benchmark creates 2 000 000 documents with 5 boolean columns
with different density, where 10, 35, 50, 60 and 90 is an amount of documents
with value 1. Each method tries to enumerate over all values in synthetic store
field in all available ways:
* baseline – in almost all cases Solr uses FixedBitSet in filter cache to keep
store availability. This test just iterates over all bits.
* docValuesRaw – iterates over all values of DV column, the same code is used
in "post filtering", sorting and faceting.
* docValuesNumbersQuery – iterates over all values produced by query/filter
store:1, actually there is the only query implementation for DV based fields -
DocValuesNumbersQuery. This means that Lucene rewrites all term, range and
filter queries for non indexed filed into this fallback implementation.
* docValuesBooleanQuery – optimized variant of DocValuesNumbersQuery, which
support only two values – 0/1
!results2.png|thumbnail!
> Boolean DocValues Codec Implementation
> --------------------------------------
>
> Key: LUCENE-8689
> URL: https://issues.apache.org/jira/browse/LUCENE-8689
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/codecs
> Reporter: Ivan Mamontov
> Priority: Minor
> Labels: patch, performance
> Attachments: LUCENE-8689.patch, SynteticDocValuesBench70.java,
> results.png, results2.png
>
>
> To avoid issues where some products become available/unavailable at some
> point in time after being out-of-stock, e-commerce search system designers
> need to embed up-to-date information about inventory availability right into
> the search engines. Key requirement is to be able to accurately filter out
> unavailable products and use availability as one of ranking signals. However,
> keeping availability data up-to-date is a non-trivial task. Straightforward
> implementation based on a partial updates of Lucene documents causes Solr
> cache trashing with negatively affected query performance and resource
> utilization.
> As an alternative solution we can use DocValues and build-in in-place
> updates where field values can be independently updated without touching
> inverted index, and while filtering by DocValues is a bit slower, overall
> performance gain is better. However existing long based docValues are not
> sufficiently optimized for carrying boolean inventory availability data:
> * All DocValues queries are internally rewritten into
> org.apache.lucene.search.DocValuesNumbersQuery which is based on direct
> iteration over all column values and typically much slower than using
> TermsQuery.
> * On every commit/merge codec has to iterate over DocValues a couple times
> in order to choose ths best compression algorithm suitable for given data. As
> a result for 4K fields and 3M max doc merge takes more than 10 minutes
> This issue is intended to solve these limitations via special bitwise doc
> values format that uses internal representation of
> org.apache.lucene.util.FixedBitSet in order to store indexed values and load
> them at search time as a simple long array without additional decoding. There
> are several reasons for this:
> * At index time encoding is super fast without superfluous iterations over
> all values to choose ths best compression algorithm suitable for given
> data.</li>
> * At query time decoding is also simple and fast, no GC pressure and extra
> steps
> * Internal representation allows to perform random access in constant time
> Limitations are:
> * Does not support non boolean fields
> * Boolean fields must be represented as long values 1 for true and 0 for
> false
> * Current implementation does not support advanced bit set formats like
> org.apache.lucene.util.SparseFixedBitSet or
> org.apache.lucene.util.RoaringDocIdSet
> In order to evaluate performance gain I've wrote a simple JMH based benchmark
> [^SynteticDocValuesBench70.java] which allows to estimate a relative cost
> of DF filters. This benchmark creates 2 000 000 documents with 5 boolean
> columns with different density, where 10, 35, 50, 60 and 90 is an amount of
> documents with value 1. Each method tries to enumerate over all values in
> synthetic store field in all available ways:
> * baseline – in almost all cases Solr uses FixedBitSet in filter cache to
> keep store availability. This test just iterates over all bits.
> * docValuesRaw – iterates over all values of DV column, the same code is
> used in "post filtering", sorting and faceting.
> * docValuesNumbersQuery – iterates over all values produced by query/filter
> store:1, actually there is the only query implementation for DV based fields
> - DocValuesNumbersQuery. This means that Lucene rewrites all term, range and
> filter queries for non indexed filed into this fallback implementation.
> * docValuesBooleanQuery – optimized variant of DocValuesNumbersQuery, which
> support only two values – 0/1
> !results2.png!
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]