[jira] [Commented] (SOLR-5354) Distributed sort is broken with CUSTOM FieldType

Hoss Man (JIRA) Fri, 22 Nov 2013 11:09:09 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-5354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13830238#comment-13830238
 ]


Hoss Man commented on SOLR-5354:
--------------------------------

I've been looking into Solr's distributed sorting code more and more as part of 
my investigating into SOLR-5463 and i spoke breifly with sarowe off line about 
the overlap.

I think the problems with CUSTOM distributed sorting is really just a subset of 
the larger weirdness with the assumptions Solr makes in general about how it 
can do distributed sorting and how it can de/serialize the sort values when 
merging hte results from the multiple shards.

I think my earlier suggestion (in email that jessica quoted in the issue 
summary) about using methods on the FieldType (like indexedToReadable and 
toObject) to ensure we safely de/serialize the sort values are still the right 
way to go -- we have to ensure that no matter what strange object an arbitrary 
objects are used by a FieldComparator, we can safely serialize it.   But i'm 
not longer convinced re-using those existing methods makes sense -- because the 
sort values used by a FieldType's FieldComparator may not map directly to the 
"end user" representation of the value (ie: TriedDateField sorts as "long", but 
toObject returns "Date"; String fields sort on BytesRefs; Custom classes sort 
on who-knows-what, etc...)

I think the best solution would be something like:


* move the toExternal/toInternal concept in the existing patch out of 
FieldComparatorSource and into Solr's FieldType as methods clearly ment to be 
very speciic to sorting (ie: "marshalSortValue" and "unmarshalSortValue")>
* change the fsv=true logic on shards to use marshalSortValue for any SortField 
that is on a field (if it's score or a function it will be a sinple numeric and 
already safe to serialize over the wire)
* change the mergeIds logic on the coordinator node to explicitly use 
unmarshalSortValue and then use the _actual_ FieldComparator associated with 
each SortField instead of the hooky assumptions currently being made in 
ShardFieldSortedHitQueue.getCachedComparator about using things like 
"comparatorNatural"

----

Other misc comments...


bq. If the deserialization method depends on FieldType, the node responsible 
for the merge must also have the schema loaded, which might not be the case in 
SolrCloud.

That's already a requirement in SolrCloud - the coordnator node merging results 
and writing them back to the client already has to have the same schema.  (If 
it didn't a custom FieldType with a custom FieldComparator could never work, 
because there would be now way at all to know what order things should go in)

bq. I think solr should fix its own apis here? It could add FieldType[] to 
SortSpec or something like that.

I'm not sure why that would help?  We can already ask each SortField for it's 
getField() and then look that up in the Schema.  The crux of the problem really 
seems to be: naive assumptions in the distributed sorting code about how to 
safely send sort values over the wire; and what comparator to use when sorting 
those values.


> Distributed sort is broken with CUSTOM FieldType
> ------------------------------------------------
>
>                 Key: SOLR-5354
>                 URL: https://issues.apache.org/jira/browse/SOLR-5354
>             Project: Solr
>          Issue Type: Bug
>          Components: SearchComponents - other
>    Affects Versions: 4.4, 4.5, 5.0
>            Reporter: Jessica Cheng
>            Assignee: Steve Rowe
>              Labels: custom, query, sort
>         Attachments: SOLR-5354.patch
>
>
> We added a custom field type to allow an indexed binary field type that 
> supports search (exact match), prefix search, and sort as unsigned bytes 
> lexicographical compare. For sort, BytesRef's UTF8SortedAsUnicodeComparator 
> accomplishes what we want, and even though the name of the comparator 
> mentions UTF8, it doesn't actually assume so and just does byte-level 
> operation, so it's good. However, when we do this across different nodes, we 
> run into an issue where in QueryComponent.doFieldSortValues:
>           // Must do the same conversion when sorting by a
>           // String field in Lucene, which returns the terms
>           // data as BytesRef:
>           if (val instanceof BytesRef) {
>             UnicodeUtil.UTF8toUTF16((BytesRef)val, spare);
>             field.setStringValue(spare.toString());
>             val = ft.toObject(field);
>           }
> UnicodeUtil.UTF8toUTF16 is called on our byte array,which isn't actually 
> UTF8. I did a hack where I specified our own field comparator to be 
> ByteBuffer based to get around that instanceof check, but then the field 
> value gets transformed into BYTEARR in JavaBinCodec, and when it's 
> unmarshalled, it gets turned into byte[]. Then, in QueryComponent.mergeIds, a 
> ShardFieldSortedHitQueue is constructed with ShardDoc.getCachedComparator, 
> which decides to give me comparatorNatural in the else of the TODO for 
> CUSTOM, which barfs because byte[] are not Comparable...
> From Chris Hostetter:
> I'm not very familiar with the distributed sorting code, but based on your
> comments, and a quick skim of the functions you pointed to, it definitely
> seems like there are two problems here for people trying to implement
> custom sorting in custom FieldTypes...
> 1) QueryComponent.doFieldSortValues - this definitely seems like it should
> be based on the FieldType, not an "instanceof BytesRef" check (oddly: the
> comment event suggestsion that it should be using the FieldType's
> indexedToReadable() method -- but it doesn't do that.  If it did, then
> this part of hte logic should work for you as long as your custom
> FieldType implemented indexedToReadable in a sane way.
> 2) QueryComponent.mergeIds - that TODO definitely looks like a gap that
> needs filled.  I'm guessing the sanest thing to do in the CUSTOM case
> would be to ask the FieldComparatorSource (which should be coming from the
> SortField that the custom FieldType produced) to create a FieldComparator
> (via newComparator - the numHits & sortPos could be anything) and then
> wrap that up in a Comparator facade that delegates to
> FieldComparator.compareValues
> That way a custom FieldType could be in complete control of the sort
> comparisons (even when merging ids).
> ...But as i said: i may be missing something, i'm not super familia with
> that code.  Please try it out and let us know if thta works -- either way
> please open a Jira pointing out the problems trying to implement
> distributed sorting in a custom FieldType.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-5354) Distributed sort is broken with CUSTOM FieldType

Reply via email to