Jan, Thanks a lot for the response. My application's indexer is generating the id based on the raw data and another metadata field that distinguishes that piece of data to the origin. Now I can leverage the concept of the unique key to ensure uniqueness per origin per row (at least that what I did before I migrated to TRA). Now with the new rules of collection aliases I have to make sure that the indexed doc wasn't indexed before and that makes it harder to manage and will affect indexing performance without a doubt. I really liked your idea of making a query time distinct, I think that I can live with the fact that my big TRA has some dups across the collections and in query time I will "hide" them but two questions now: 1) How will using the collapse query parser will affect the query performance - sounds to me that it depends on the size of the result set, is it? 2) I tried what you've suggested on the very same simplified use-case and it didn't work for me - it seems that the collapse doesn't affect the way solr calculates the total amount of faceted fields, should I add something else? what I did: http://localhost:8983/solr/test/select?fq=%7B!collapse%20field%3Did%7D&q=*%3A*&facet=on&facet.field=id {
- responseHeader: { - zkConnected: true, - status: 0, - QTime: 8, - params: { - q: "*:*", - facet.field: "id", - fq: "{!collapse field=id}", - facet: "on" } }, - response: { - numFound: 1, - start: 0, - maxScore: 1, - numFoundExact: true, - docs: [ - { - id: "123", - _version_: 1696500688522051600, - score: 1 } ] }, - facet_counts: { - facet_queries: { }, - facet_fields: { - id: [ - "123", - 2 ] }, - facet_ranges: { }, - facet_intervals: { }, - facet_heatmaps: { } } } . . **BUT! while trying your idea I thought about another idea - use sub-facet on the faceted field while I am firing a unique facet function on the same field like so: http://localhost:8983/solr/test/select?&q=*%3A*&json.facet={ids:{type:terms,field:id,facet:{unique_count:%22unique(id)%22}}} and if I add another doc {"id":"abc"} for illustration I get: { - responseHeader: { - zkConnected: true, - status: 0, - QTime: 19, - params: { - q: "*:*", - json.facet: "{ids:{type:terms,field:id,facet:{unique_count:"unique(id)"}}}" } }, - response: { - numFound: 2, - start: 0, - maxScore: 1, - numFoundExact: true, - docs: [ - { - id: "123", - _version_: 1696500688522051600 }, - { - id: "abc", - _version_: 1696504041626927000 } ] }, - facets: { - count: 3, - ids: { - buckets: [ - { - val: "123", - count: 2, - unique_count: 1 }, - { - val: "abc", - count: 1, - unique_count: 1 } ] } } } And I think that that basically can solve my issue - I am allowing dups across the TRA collections and just "ignoring" them with this approach. WDYT? Do I miss something? How's facet functions and specifically the unique facet function in terms of performance? especially when it's nested... Looking forward to read WYT and others :) THANKS! בתאריך יום ה׳, 8 באפר׳ 2021 ב-15:52 מאת Jan Høydahl < jan....@cominvent.com>: > You are right - when you want to search across multiple collections, > whether through alias or explicitly, Solr does no longer guarantee the > uniqueness of IDs for you, as that is only per collection. > Meaning, you need to enforce ID uniqueness yourself. And if using routed > aliases, ..."It’s extremely important with all routed aliases that the > route values NOT change." > > So if this is outside your control, the question becomes - are documents > with same ID really duplicates and should not be counted twice? Or are they > distinct docs which happen to have same ID? > If they ideed are duplicates, you may attempt to do duplicate removal in > your query by e.g. adding fq={!collapse field=id} to your query > > Jan > > > 24. mar. 2021 kl. 18:09 skrev Eran Buchnick <buchni...@gmail.com>: > > > > Hi, > > I've noticed the following warning in the *aliases documentation*: > > *"...Reindexing a document with a different route value for the same ID* > > *produces two distinct documents with the same ID accessible via the* > > *alias..."* > > When tested such case it seems that really only one doc is retrieved but > > when turning on *facets they aren't aligned with the result set.* > > > > Expected behavior or bug? > > If expected - how should I avoid dups and implement upserts without the > > overhead of preliminary queries? > > > > My test: > > 1) create two collections test1 and test2 and alias named test for both > > 2) index docs with the same id to both of the collections > > {"id":123} > > 3) querying the alias as followed with explained debug: > > > http://localhost:8983/solr/test/select?debug.explain.structured=true&debugQuery=on&facet.field=id&facet=on&q=*%3A* > > { > > "responseHeader":{ > > "zkConnected":true, > > "status":0, > > "QTime":25, > > "params":{ > > "q":"*:*", > > "facet.field":"id", > > "debug.explain.structured":"true", > > "facet":"on", > > "debugQuery":"on", > > "_":"1616269705741"}}, > > > > "response":{*"numFound":1* > > ,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[ > > { > > "id":"123", > > "_version_":1694670492462481408}] > > }, > > "facet_counts":{ > > "facet_queries":{}, > > "facet_fields":{ > > *"id":[* > > * "123",2*]}, > > "facet_ranges":{}, > > "facet_intervals":{}, > > "facet_heatmaps":{}}, > > "debug":{ > > "track":{ > > "rid":"-31", > > "EXECUTE_QUERY":{ > > "http://some_ip:8983/solr/test2_shard1_replica_n1/":{ > > "QTime":"3", > > "ElapsedTime":"10", > > "RequestPurpose":"GET_TOP_IDS,GET_FACETS,SET_TERM_STATS", > > "NumFound":"1", > > > > > "Response":"{responseHeader={zkConnected=true,status=0,QTime=3,params={df=_text_,distrib=false,fl=[id, > > score],shards.purpose=16404,fsv=true,shard.url= > > > http://some_ip:8983/solr/test2_shard1_replica_n1/,rid=-31,wt=javabin,_=1616269705741,facet.field=id,f.id.facet.mincount=0,debug=[false > > , > > timing, > > > track],start=0,f.id.facet.limit=160,collection=test1,test2,rows=10,debug.explain.structured=true,version=2,q=*:*,omitHeader=false,requestPurpose=GET_TOP_IDS,GET_FACETS,SET_TERM_STATS,NOW=1616270594521,isShard=true,facet=on,debugQuery=false}},response={numFound=1,numFoundExact=true,start=0,maxScore=1.0,docs=[SolrDocument{id=123, > > > score=1.0}]},sort_values={},facet_counts={facet_queries={},facet_fields={id={123=1}},facet_ranges={},facet_intervals={},facet_heatmaps={}},debug={facet-debug={elapse=0,sub-facet=[{processor=SimpleFacets,elapse=0,action=field > > facet,maxThreads=0,sub-facet=[{elapse=0,requestedMethod=not > > > specified,appliedMethod=FC,inputDocSetSize=1,field=id,numBuckets=2}]}]},timing={time=2.0,prepare={time=0.0,query={time=0.0},facet={time=0.0},facet_module={time=0.0},mlt={time=0.0},highlight={time=0.0},stats={time=0.0},expand={time=0.0},terms={time=0.0},debug={time=0.0}},process={time=2.0,query={time=0.0},facet={time=1.0},facet_module={time=0.0},mlt={time=0.0},highlight={time=0.0},stats={time=0.0},expand={time=0.0},terms={time=0.0},debug={time=0.0}}}}}"}, > > "http://some_ip:8983/solr/test1_shard1_replica_n1/":{ > > "QTime":"2", > > "ElapsedTime":"12", > > "RequestPurpose":"GET_TOP_IDS,GET_FACETS,SET_TERM_STATS", > > "NumFound":"1", > > > > > "Response":"{responseHeader={zkConnected=true,status=0,QTime=2,params={df=_text_,distrib=false,fl=[id, > > score],shards.purpose=16404,fsv=true,shard.url= > > > http://some_ip:8983/solr/test1_shard1_replica_n1/,rid=-31,wt=javabin,_=1616269705741,facet.field=id,f.id.facet.mincount=0,debug=[false > > , > > timing, > > > track],start=0,f.id.facet.limit=160,collection=test1,test2,rows=10,debug.explain.structured=true,version=2,q=*:*,omitHeader=false,requestPurpose=GET_TOP_IDS,GET_FACETS,SET_TERM_STATS,NOW=1616270594521,isShard=true,facet=on,debugQuery=false}},response={numFound=1,numFoundExact=true,start=0,maxScore=1.0,docs=[SolrDocument{id=123, > > > score=1.0}]},sort_values={},facet_counts={facet_queries={},facet_fields={id={123=1}},facet_ranges={},facet_intervals={},facet_heatmaps={}},debug={facet-debug={elapse=0,sub-facet=[{processor=SimpleFacets,elapse=0,action=field > > facet,maxThreads=0,sub-facet=[{elapse=0,requestedMethod=not > > > specified,appliedMethod=FC,inputDocSetSize=1,field=id,numBuckets=2}]}]},timing={time=2.0,prepare={time=0.0,query={time=0.0},facet={time=0.0},facet_module={time=0.0},mlt={time=0.0},highlight={time=0.0},stats={time=0.0},expand={time=0.0},terms={time=0.0},debug={time=0.0}},process={time=2.0,query={time=0.0},facet={time=1.0},facet_module={time=0.0},mlt={time=0.0},highlight={time=0.0},stats={time=0.0},expand={time=0.0},terms={time=0.0},debug={time=0.0}}}}}"}}, > > "GET_FIELDS":{ > > "http://some_ip:8983/solr/test2_shard1_replica_n1/":{ > > "QTime":"5", > > "ElapsedTime":"8", > > "RequestPurpose":"GET_FIELDS,GET_DEBUG,SET_TERM_STATS", > > "NumFound":"1", > > > > > "Response":"{responseHeader={zkConnected=true,status=0,QTime=5,params={facet.field=id,df=_text_,distrib=false,debug=[timing, > > track],shards.purpose=16704,collection=test1,test2,shard.url= > > > http://some_ip:8983/solr/test2_shard1_replica_n1/,rows=10,rid=-31,debug.explain.structured=true,version=2,q=*:*,omitHeader=false,requestPurpose=GET_FIELDS,GET_DEBUG,SET_TERM_STATS,NOW=1616270594521,ids=123,isShard=true,facet=false,wt=javabin,debugQuery=true,_=1616269705741 > } > > > },response={numFound=1,numFoundExact=true,start=0,docs=[SolrDocument{id=123, > > > _version_=1694670492462481408}]},debug={rawquerystring=*:*,querystring=*:*,parsedquery=MatchAllDocsQuery(*:*),parsedquery_toString=*:*,explain={123={match=true,value=1.0,description=*:*}},QParser=LuceneQParser,timing={time=4.0,prepare={time=0.0,query={time=0.0},facet={time=0.0},facet_module={time=0.0},mlt={time=0.0},highlight={time=0.0},stats={time=0.0},expand={time=0.0},terms={time=0.0},debug={time=0.0}},process={time=4.0,query={time=0.0},facet={time=0.0},facet_module={time=0.0},mlt={time=0.0},highlight={time=0.0},stats={time=0.0},expand={time=0.0},terms={time=0.0},debug={time=4.0}}}}}"}}}, > > "facet-debug":{ > > "elapse":0, > > "sub-facet":[{ > > "processor":"SimpleFacets", > > "elapse":0, > > "action":"field facet", > > "maxThreads":0, > > "sub-facet":[{ > > "elapse":0, > > "requestedMethod":"not specified", > > "appliedMethod":"FC", > > "inputDocSetSize":1, > > "field":"id", > > "numBuckets":2}]}]}, > > "timing":{ > > "time":8.0, > > "prepare":{ > > "time":0.0, > > "query":{ > > "time":0.0}, > > "facet":{ > > "time":0.0}, > > "facet_module":{ > > "time":0.0}, > > "mlt":{ > > "time":0.0}, > > "highlight":{ > > "time":0.0}, > > "stats":{ > > "time":0.0}, > > "expand":{ > > "time":0.0}, > > "terms":{ > > "time":0.0}, > > "debug":{ > > "time":0.0}}, > > "process":{ > > "time":8.0, > > "query":{ > > "time":0.0}, > > "facet":{ > > "time":2.0}, > > "facet_module":{ > > "time":0.0}, > > "mlt":{ > > "time":0.0}, > > "highlight":{ > > "time":0.0}, > > "stats":{ > > "time":0.0}, > > "expand":{ > > "time":0.0}, > > "terms":{ > > "time":0.0}, > > "debug":{ > > "time":4.0}}}, > > "rawquerystring":"*:*", > > "querystring":"*:*", > > "parsedquery":"MatchAllDocsQuery(*:*)", > > "parsedquery_toString":"*:*", > > "QParser":"LuceneQParser", > > "explain":{ > > "123":{ > > "match":true, > > "value":1.0, > > "description":"*:*"}}}} > > > > Thanks. > > -- *BR,* *Eran Buchnick*