[
https://issues.apache.org/jira/browse/SOLR-11741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16315465#comment-16315465
]
Abhishek Kumar Singh edited comment on SOLR-11741 at 1/7/18 9:02 PM:
---------------------------------------------------------------------
The above approach can be optimised by replacing the *Supported FieldTypes* by
*_BitSets_* ,
As shown in the following table:-
!screenshot-1.png!
We can map every FieldType to a BitSet. For eg. *String will be 10000* , *Long
will be 00100* and so on..
1. Now For every product, get the BitSet of the fieldType supported by each
field
2. For every field, Find the *_BITWISE OR_* of the current BitSet with the
BitSet value already recorded, and replace it.
Use the following rule to decide the final FieldType that the field should
have.
!RuleForMostAccomodatingField.png!
Say if a field called *price* has values as following values:
In Product1 -> *12321 (Long, i.e. 00100)*
In Product2 -> *77261.66 (Double, i.e. 01000)*
The supported BitSet for *price* will have a final value of *[ 00100 OR 01000 =
01100 ]* , i.e. It should be assigned a Double.
The above rule can be extended to any number of types, just the number of bits
will increase accordingly.
Using BitSets like above will decrease the storage space to 1 byte per field,
will make the computation easier and faster, and will also remove the overhead
of computing the trained schema separately, as they will be updated in-place
with every Product.
Every api call to ask for *Trained Schema*, will get the schema calculated
till that point using the above rule.
was (Author: abhidemon):
The above approach can be optimised by replacing the *Supported FieldTypes* by
*_BitSets_* ,
As shown in the following table:-
!screenshot-1.png!
We can map every FieldType to a BitSet. For eg. *String will be 10000* , *Long
will be 00100* and so on..
1. Now For every product, get the BitSet of the fieldType supported by each
field
2. For every field, Find the *_BITWISE OR_* of the current BitSet with the
BitSet value already recorded, and replace it.
Use the following rule to decide the final FieldType that the field should
have.
!screenshot-3.png!
Say if a field called *price* has values as following values:
In Product1 -> *12321 (Long, i.e. 00100)*
In Product2 -> *77261.66 (Double, i.e. 01000)*
The supported BitSet for *price* will have a final value of *[ 00100 OR 01000 =
01100 ]* , i.e. It should be assigned a Double.
The above rule can be extended to any number of types, just the number of bits
will increase accordingly.
Using BitSets like above will decrease the storage space to 1 byte per field,
will make the computation easier and faster, and will also remove the overhead
of computing the trained schema separately, as they will be updated in-place
with every Product.
Every api call to ask for *Trained Schema*, will get the schema calculated
till that point using the above rule.
> Offline training mode for schema guessing
> -----------------------------------------
>
> Key: SOLR-11741
> URL: https://issues.apache.org/jira/browse/SOLR-11741
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Ishan Chattopadhyaya
> Attachments: RuleForMostAccomodatingField.png, SOLR-11741-temp.patch,
> screenshot-1.png, screenshot-3.png
>
>
> Our data driven schema guessing doesn't work under many situations. For
> example, if the first document has a field with value "0", it is guessed as
> Long and subsequent fields with "0.0" are rejected. Similarly, if the same
> field had alphanumeric contents for a latter document, those documents are
> rejected. Also, single vs. multi valued field guessing is not ideal.
> Proposing an offline training mode where Solr accepts bunch of documents and
> returns a guessed schema (without indexing). This schema can then be used for
> actual indexing. I think the original idea is from Hoss.
> I think initial implementation can be based on an UpdateRequestProcessor. We
> can hash out the API soon, as we go along.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]