[
https://issues.apache.org/jira/browse/LUCENE-4583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13658088#comment-13658088
]
David Smiley commented on LUCENE-4583:
--------------------------------------
I can understand that an all in-RAM codec has size sensitivities. In that
light, I can also understand that 32KB per document is a lot. The _average_
per-document variable byte length size for Barakat's index is a measly 10
bytes. The maximum is around 69k. Likewise for the user Shai referenced on
the list who was using it for faceting, it's only the worst-case document(s)
that exceeded 32KB.
Might the "new PagedBytes(16)" in Lucene42DocValuesProducer.loadBinary() be
made configurable? i.e. Make 16 configurable? And/or perhaps make loadBinary()
protected so another codec extending this one can keep the change somewhat
minimal.
Mike, in your latest patch, one improvement that could be made is instead of
Lucene42DocValuesConsumer assuming the limit is "ByteBlockPool.BYTE_BLOCK_SIZE
- 2" (which it technically is _but only by coincidence_), you could instead
reference a calculated constant shared with the actual code that has this limit
which is Lucene42DocValuesProducer.loadBinary(). For example, set the constant
to 2^16-2 but then add an assert in loadBinary that the constant is consistent
with the PagedBytes instance's config. Or something like that.
bq. David can you open a separate issue about changing the limit for existing
codecs?
Uh... all the discussion has been here so seems too late to me. And I'm
probably done making my arguments. I can't be more convincing than pointing
out the 10-byte average figure for my use case.
> StraightBytesDocValuesField fails if bytes > 32k
> ------------------------------------------------
>
> Key: LUCENE-4583
> URL: https://issues.apache.org/jira/browse/LUCENE-4583
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/index
> Affects Versions: 4.0, 4.1, 5.0
> Reporter: David Smiley
> Priority: Critical
> Fix For: 4.4
>
> Attachments: LUCENE-4583.patch, LUCENE-4583.patch, LUCENE-4583.patch,
> LUCENE-4583.patch, LUCENE-4583.patch
>
>
> I didn't observe any limitations on the size of a bytes based DocValues field
> value in the docs. It appears that the limit is 32k, although I didn't get
> any friendly error telling me that was the limit. 32k is kind of small IMO;
> I suspect this limit is unintended and as such is a bug. The following
> test fails:
> {code:java}
> public void testBigDocValue() throws IOException {
> Directory dir = newDirectory();
> IndexWriter writer = new IndexWriter(dir, writerConfig(false));
> Document doc = new Document();
> BytesRef bytes = new BytesRef((4+4)*4097);//4096 works
> bytes.length = bytes.bytes.length;//byte data doesn't matter
> doc.add(new StraightBytesDocValuesField("dvField", bytes));
> writer.addDocument(doc);
> writer.commit();
> writer.close();
> DirectoryReader reader = DirectoryReader.open(dir);
> DocValues docValues = MultiDocValues.getDocValues(reader, "dvField");
> //FAILS IF BYTES IS BIG!
> docValues.getSource().getBytes(0, bytes);
> reader.close();
> dir.close();
> }
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]