On 08/09/2015 04:55 PM, Arjen van der Meijden wrote:
On 9-8-2015 16:22, Toke Eskildsen wrote:
Robert Muir <rcm...@gmail.com> wrote:
I am tired of repeating this:
Don't use BINARY docvalues
Don't use BINARY docvalues
Don't use BINARY docvalues
Use types like SORTED/SORTED_SET which will compress the term
dictionary and make use of ordinals in your application instead.
This seems contrary to
http://lucene.apache.org/core/5_2_0/core/org/apache/lucene/document/BinaryDocValuesField.html
Maybe you could update the JavaDoc for that field to warn against using it?
It (probably) depends on the contents of the values. If the number of
distinct values is roughly equal to the number of documents the javadoc
suggest the binary docvalues are a valid choice.
My values are unique and equal to the number of documents,
They have varying sizes, say at least 10 bytes and may be a lot bigger
(say 4kbytes)
I don't share, index or sort them.
I don't do grouping/faceting either
I only want to store, retrieve and traverse those values
That's this part:
"The values are stored directly with no sharing, which is a good fit
when the fields don't share (many) values, such as a title field."
If there are (much) less distinct values than documents, Robert's reply
and the documentation suggest the same:
" If values may be shared and sorted it's better to use
SortedDocValuesField."
So as soon as compression of smallish values starts making sense due to
repetition amongst documents, it may be time to move away from the
BinaryDocValuesField towards another variant.
If only parts of the values are repeated (for instance something like
e-mail addresses where many will end with 'gmail.com' and 'outlook.com')
it becomes more complicated.
At the moment, there are some repeated parts inside but a lot of
repeated parts across docIds like "Expression", "Reading"
Also, I'm stuck with using Lucene 4.7.0 (or 4.7.2) because starting with
version 4.8, lucene uses "try with resource" and this isn't supported on
Android before Android 4.4
SortedDocValuesField stores a per-document|BytesRef|
<http://lucene.apache.org/core/5_2_0/core/org/apache/lucene/util/BytesRef.html>value,
indexed for sorting.
If you also need to store the value, you should add a
separate|StoredField|
<http://lucene.apache.org/core/5_2_0/core/org/apache/lucene/document/StoredField.html>instance.
I actually went with the binaryDocValues because I thought that
DocValues were way more efficient than the pre 4.0 fields to store stuff
(like only using 1 seek/read ...with mmap...), especially with traversal.
In my app, I traverse all binaryDocValues in increading docId order,
unserializes my docValues (lightning fast with FlatBuffers, no object
creation -> complex objects) and do some stats....
Would I be able to do that as efficiently with a StoredField ?
Apparently, only StoredField are compressed
CompressingStoredFieldsFormat
Maybee I should use that (and ditch the useless docValue or make it
store a bytesRef) to get compression ?
Many thanks for all the insights, :)
Olivier
Best regards,
Arjen
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org