[ 
https://issues.apache.org/jira/browse/SOLR-10255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Smiley updated SOLR-10255:
--------------------------------
    Attachment: SOLR-10255.patch

Here's a patch that's in-progress with a bunch of nocommits/discussion points.  
It theoretically works but *there are no tests yet* so I doubt it :-).
* I added a "large" flag to FieldType but in hindsight perhaps this belongs on 
TextField because I'm only adding it there?  BTW a ramification of this is that 
you wouldn't be able to set it on the field definition, only the fieldType.  I 
could see this being useful on BinaryField but I don't intend to work on that.
* The BinaryDocValuesField is given a separate name from the base name, 
{{___large_}} prefix.  I didn't have to do this but I want to allow for 
TextField to some day have conventional SortedSetDocValues on 
analyzed/tokenized text.  In Lucene we can't have both types of DocValues for 
the same field name.
* I sorta cheat and we pretend the field is still "stored" but in reality it's 
not... at least it's not "stored" in the Lucene sense.  This is deliberate 
because I want this field to be compatible with various other Solr features 
that don't know anything about this new "large" concept.
* One unfortunate thing here is that the doc related loading in 
SolrIndexSearcher now has to call {{DocValues.getBinary(getSlowAtomicReader(), 
TextField.LARGE_NAME_PREFIX + largeField)}} and then call 
{{advanceExact(docId)}} for each field in the schema that's marked as large.  
This is done so that we know if the field even has a large value for this 
document.  It's almost always necessary to do this if there are any declared 
large fields.  This may not be a big deal in the scheme of things?  One 
possible solution is for {{TextField.createFields()}} to add a special stored 
field named perhaps {{___largeFields}}} and supply the field name as a value.

In a separate issue I'll propose a compressed DocValuesFormat that Solr's 
SchemaCodecFactory will supply for fields starting with "___large_". Or maybe I 
might have it be an auto-registed internal field type in the schema; we'll see.

BTW this approach is incompatible with multiValued fields since BinaryDocValues 
has this limitation.

_I'd really appreciate peer review, even if it's just a cursory look at the 
patch_

> Large psuedo-stored fields via BinaryDocValuesField
> ---------------------------------------------------
>
>                 Key: SOLR-10255
>                 URL: https://issues.apache.org/jira/browse/SOLR-10255
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: David Smiley
>            Assignee: David Smiley
>         Attachments: SOLR-10255.patch
>
>
> (sub-issue of SOLR-10117)  This is a proposal for a better way for Solr to 
> handle "large" text fields.  Large docs that are in Lucene StoredFields slow 
> requests that don't involve access to such fields.  This is fundamental to 
> the fact that StoredFields are row-stored.  Worse, the Solr documentCache 
> will wind up holding onto massive Strings.  While the latter could be tackled 
> on it's own somehow as it's the most serious issue, nevertheless it seems 
> wrong that such large fields are in row-stored storage to begin with.  After 
> all, relational DBs seemed to have figured this out and put CLOBs/BLOBs in a 
> separate place.  Here, we do similarly by using, Lucene 
> {{BinaryDocValuesField}}.  BDVF isn't well known in the DocValues family as 
> it's not for typical DocValues purposes like sorting/faceting etc.  The 
> default DocValuesFormat doesn't compress these but we could write one that 
> does.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to