Jan,
Thanks a lot for the response.
My application's indexer is generating the id based on the raw data and
another metadata field that distinguishes that piece of data to the origin.
Now I can leverage the concept of the unique key to ensure uniqueness per
origin per row (at least that what I did before I migrated to TRA).
Now with the new rules of collection aliases I have to make sure that the
indexed doc wasn't indexed before and that makes it harder to manage and
will affect indexing performance without a doubt.
I really liked your idea of making a query time distinct, I think that I
can live with the fact that my big TRA has some dups across the collections
and in query time I will "hide" them but two questions now:
1) How will using the collapse query parser will affect the query
performance - sounds to me that it depends on the size of the result set,
is it?
2) I tried what you've suggested on the very same simplified use-case and
it didn't work for me - it seems that the collapse doesn't affect the way
solr calculates the total amount of faceted fields, should I add something
else? what I did:
http://localhost:8983/solr/test/select?fq=%7B!collapse%20field%3Did%7D&q=*%3A*&facet=on&facet.field=id
{

   - responseHeader:
   {
      - zkConnected: true,
      - status: 0,
      - QTime: 8,
      - params:
      {
         - q: "*:*",
         - facet.field: "id",
         - fq: "{!collapse field=id}",
         - facet: "on"
         }
      },
   - response:
   {
      - numFound: 1,
      - start: 0,
      - maxScore: 1,
      - numFoundExact: true,
      - docs:
      [
         -
         {
            - id: "123",
            - _version_: 1696500688522051600,
            - score: 1
            }
         ]
      },
   - facet_counts:
   {
      - facet_queries: { },
      - facet_fields:
      {
         - id:
         [
            - "123",
            - 2
            ]
         },
      - facet_ranges: { },
      - facet_intervals: { },
      - facet_heatmaps: { }
      }

}
.
.

**BUT! while trying your idea I thought about another idea - use sub-facet
on the faceted field while I am firing a unique facet function on the same
field like so:
http://localhost:8983/solr/test/select?&q=*%3A*&json.facet={ids:{type:terms,field:id,facet:{unique_count:%22unique(id)%22}}}
and if I add another doc {"id":"abc"} for illustration I get:
{

   - responseHeader:
   {
      - zkConnected: true,
      - status: 0,
      - QTime: 19,
      - params:
      {
         - q: "*:*",
         - json.facet:
         "{ids:{type:terms,field:id,facet:{unique_count:"unique(id)"}}}"
         }
      },
   - response:
   {
      - numFound: 2,
      - start: 0,
      - maxScore: 1,
      - numFoundExact: true,
      - docs:
      [
         -
         {
            - id: "123",
            - _version_: 1696500688522051600
            },
         -
         {
            - id: "abc",
            - _version_: 1696504041626927000
            }
         ]
      },
   - facets:
   {
      - count: 3,
      - ids:
      {
         - buckets:
         [
            -
            {
               - val: "123",
               - count: 2,
               - unique_count: 1
               },
            -
            {
               - val: "abc",
               - count: 1,
               - unique_count: 1
               }
            ]
         }
      }

}
And I think that that basically can solve my issue - I am allowing dups
across the TRA collections and just "ignoring" them with this approach.
WDYT? Do I miss something? How's facet functions and specifically the
unique facet function in terms of performance? especially when it's
nested...

Looking forward to read WYT and others :)

THANKS!

‫בתאריך יום ה׳, 8 באפר׳ 2021 ב-15:52 מאת ‪Jan Høydahl‬‏ <‪
jan....@cominvent.com‬‏>:‬

> You are right - when you want to search across multiple collections,
> whether through alias or explicitly, Solr does no longer guarantee the
> uniqueness of IDs for you, as that is only per collection.
> Meaning, you need to enforce ID uniqueness yourself. And if using routed
> aliases, ..."It’s extremely important with all routed aliases that the
> route values NOT change."
>
> So if this is outside your control, the question becomes - are documents
> with same ID really duplicates and should not be counted twice? Or are they
> distinct docs which happen to have same ID?
> If they ideed are duplicates, you may attempt to do duplicate removal in
> your query by e.g. adding fq={!collapse field=id} to your query
>
> Jan
>
> > 24. mar. 2021 kl. 18:09 skrev Eran Buchnick <buchni...@gmail.com>:
> >
> > Hi,
> > I've noticed the following warning in the *aliases documentation*:
> > *"...Reindexing a document with a different route value for the same ID*
> > *produces two distinct documents with the same ID accessible via the*
> > *alias..."*
> > When tested such case it seems that really only one doc is retrieved but
> > when turning on *facets they aren't aligned with the result set.*
> >
> > Expected behavior or bug?
> > If expected - how should I avoid dups and implement upserts without the
> > overhead of preliminary queries?
> >
> > My test:
> > 1) create two collections test1 and test2 and alias named test for both
> > 2) index docs with the same id to both of the collections
> > {"id":123}
> > 3) querying the alias as followed with explained debug:
> >
> http://localhost:8983/solr/test/select?debug.explain.structured=true&debugQuery=on&facet.field=id&facet=on&q=*%3A*
> > {
> >  "responseHeader":{
> >    "zkConnected":true,
> >    "status":0,
> >    "QTime":25,
> >    "params":{
> >      "q":"*:*",
> >      "facet.field":"id",
> >      "debug.explain.structured":"true",
> >      "facet":"on",
> >      "debugQuery":"on",
> >      "_":"1616269705741"}},
> >
> > "response":{*"numFound":1*
> > ,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[
> >      {
> >        "id":"123",
> >        "_version_":1694670492462481408}]
> >  },
> >  "facet_counts":{
> >    "facet_queries":{},
> >    "facet_fields":{
> >      *"id":[*
> > *        "123",2*]},
> >    "facet_ranges":{},
> >    "facet_intervals":{},
> >    "facet_heatmaps":{}},
> >  "debug":{
> >    "track":{
> >      "rid":"-31",
> >      "EXECUTE_QUERY":{
> >        "http://some_ip:8983/solr/test2_shard1_replica_n1/":{
> >          "QTime":"3",
> >          "ElapsedTime":"10",
> >          "RequestPurpose":"GET_TOP_IDS,GET_FACETS,SET_TERM_STATS",
> >          "NumFound":"1",
> >
> >
> "Response":"{responseHeader={zkConnected=true,status=0,QTime=3,params={df=_text_,distrib=false,fl=[id,
> > score],shards.purpose=16404,fsv=true,shard.url=
> >
> http://some_ip:8983/solr/test2_shard1_replica_n1/,rid=-31,wt=javabin,_=1616269705741,facet.field=id,f.id.facet.mincount=0,debug=[false
> > ,
> > timing,
> >
> track],start=0,f.id.facet.limit=160,collection=test1,test2,rows=10,debug.explain.structured=true,version=2,q=*:*,omitHeader=false,requestPurpose=GET_TOP_IDS,GET_FACETS,SET_TERM_STATS,NOW=1616270594521,isShard=true,facet=on,debugQuery=false}},response={numFound=1,numFoundExact=true,start=0,maxScore=1.0,docs=[SolrDocument{id=123,
> >
> score=1.0}]},sort_values={},facet_counts={facet_queries={},facet_fields={id={123=1}},facet_ranges={},facet_intervals={},facet_heatmaps={}},debug={facet-debug={elapse=0,sub-facet=[{processor=SimpleFacets,elapse=0,action=field
> > facet,maxThreads=0,sub-facet=[{elapse=0,requestedMethod=not
> >
> specified,appliedMethod=FC,inputDocSetSize=1,field=id,numBuckets=2}]}]},timing={time=2.0,prepare={time=0.0,query={time=0.0},facet={time=0.0},facet_module={time=0.0},mlt={time=0.0},highlight={time=0.0},stats={time=0.0},expand={time=0.0},terms={time=0.0},debug={time=0.0}},process={time=2.0,query={time=0.0},facet={time=1.0},facet_module={time=0.0},mlt={time=0.0},highlight={time=0.0},stats={time=0.0},expand={time=0.0},terms={time=0.0},debug={time=0.0}}}}}"},
> >        "http://some_ip:8983/solr/test1_shard1_replica_n1/":{
> >          "QTime":"2",
> >          "ElapsedTime":"12",
> >          "RequestPurpose":"GET_TOP_IDS,GET_FACETS,SET_TERM_STATS",
> >          "NumFound":"1",
> >
> >
> "Response":"{responseHeader={zkConnected=true,status=0,QTime=2,params={df=_text_,distrib=false,fl=[id,
> > score],shards.purpose=16404,fsv=true,shard.url=
> >
> http://some_ip:8983/solr/test1_shard1_replica_n1/,rid=-31,wt=javabin,_=1616269705741,facet.field=id,f.id.facet.mincount=0,debug=[false
> > ,
> > timing,
> >
> track],start=0,f.id.facet.limit=160,collection=test1,test2,rows=10,debug.explain.structured=true,version=2,q=*:*,omitHeader=false,requestPurpose=GET_TOP_IDS,GET_FACETS,SET_TERM_STATS,NOW=1616270594521,isShard=true,facet=on,debugQuery=false}},response={numFound=1,numFoundExact=true,start=0,maxScore=1.0,docs=[SolrDocument{id=123,
> >
> score=1.0}]},sort_values={},facet_counts={facet_queries={},facet_fields={id={123=1}},facet_ranges={},facet_intervals={},facet_heatmaps={}},debug={facet-debug={elapse=0,sub-facet=[{processor=SimpleFacets,elapse=0,action=field
> > facet,maxThreads=0,sub-facet=[{elapse=0,requestedMethod=not
> >
> specified,appliedMethod=FC,inputDocSetSize=1,field=id,numBuckets=2}]}]},timing={time=2.0,prepare={time=0.0,query={time=0.0},facet={time=0.0},facet_module={time=0.0},mlt={time=0.0},highlight={time=0.0},stats={time=0.0},expand={time=0.0},terms={time=0.0},debug={time=0.0}},process={time=2.0,query={time=0.0},facet={time=1.0},facet_module={time=0.0},mlt={time=0.0},highlight={time=0.0},stats={time=0.0},expand={time=0.0},terms={time=0.0},debug={time=0.0}}}}}"}},
> >      "GET_FIELDS":{
> >        "http://some_ip:8983/solr/test2_shard1_replica_n1/":{
> >          "QTime":"5",
> >          "ElapsedTime":"8",
> >          "RequestPurpose":"GET_FIELDS,GET_DEBUG,SET_TERM_STATS",
> >          "NumFound":"1",
> >
> >
> "Response":"{responseHeader={zkConnected=true,status=0,QTime=5,params={facet.field=id,df=_text_,distrib=false,debug=[timing,
> > track],shards.purpose=16704,collection=test1,test2,shard.url=
> >
> http://some_ip:8983/solr/test2_shard1_replica_n1/,rows=10,rid=-31,debug.explain.structured=true,version=2,q=*:*,omitHeader=false,requestPurpose=GET_FIELDS,GET_DEBUG,SET_TERM_STATS,NOW=1616270594521,ids=123,isShard=true,facet=false,wt=javabin,debugQuery=true,_=1616269705741
> }
> >
> },response={numFound=1,numFoundExact=true,start=0,docs=[SolrDocument{id=123,
> >
> _version_=1694670492462481408}]},debug={rawquerystring=*:*,querystring=*:*,parsedquery=MatchAllDocsQuery(*:*),parsedquery_toString=*:*,explain={123={match=true,value=1.0,description=*:*}},QParser=LuceneQParser,timing={time=4.0,prepare={time=0.0,query={time=0.0},facet={time=0.0},facet_module={time=0.0},mlt={time=0.0},highlight={time=0.0},stats={time=0.0},expand={time=0.0},terms={time=0.0},debug={time=0.0}},process={time=4.0,query={time=0.0},facet={time=0.0},facet_module={time=0.0},mlt={time=0.0},highlight={time=0.0},stats={time=0.0},expand={time=0.0},terms={time=0.0},debug={time=4.0}}}}}"}}},
> >    "facet-debug":{
> >      "elapse":0,
> >      "sub-facet":[{
> >          "processor":"SimpleFacets",
> >          "elapse":0,
> >          "action":"field facet",
> >          "maxThreads":0,
> >          "sub-facet":[{
> >              "elapse":0,
> >              "requestedMethod":"not specified",
> >              "appliedMethod":"FC",
> >              "inputDocSetSize":1,
> >              "field":"id",
> >              "numBuckets":2}]}]},
> >    "timing":{
> >      "time":8.0,
> >      "prepare":{
> >        "time":0.0,
> >        "query":{
> >          "time":0.0},
> >        "facet":{
> >          "time":0.0},
> >        "facet_module":{
> >          "time":0.0},
> >        "mlt":{
> >          "time":0.0},
> >        "highlight":{
> >          "time":0.0},
> >        "stats":{
> >          "time":0.0},
> >        "expand":{
> >          "time":0.0},
> >        "terms":{
> >          "time":0.0},
> >        "debug":{
> >          "time":0.0}},
> >      "process":{
> >        "time":8.0,
> >        "query":{
> >          "time":0.0},
> >        "facet":{
> >          "time":2.0},
> >        "facet_module":{
> >          "time":0.0},
> >        "mlt":{
> >          "time":0.0},
> >        "highlight":{
> >          "time":0.0},
> >        "stats":{
> >          "time":0.0},
> >        "expand":{
> >          "time":0.0},
> >        "terms":{
> >          "time":0.0},
> >        "debug":{
> >          "time":4.0}}},
> >    "rawquerystring":"*:*",
> >    "querystring":"*:*",
> >    "parsedquery":"MatchAllDocsQuery(*:*)",
> >    "parsedquery_toString":"*:*",
> >    "QParser":"LuceneQParser",
> >    "explain":{
> >      "123":{
> >        "match":true,
> >        "value":1.0,
> >        "description":"*:*"}}}}
> >
> > Thanks.
>
>

-- 
*BR,*

*Eran Buchnick*

Reply via email to