Re: Fewer results when duplicating phrase in a query with OR

2021-07-02 Thread Jan Høydahl
Which query parser are you using, and what is the config of your search 
handler? I suspect that there is some implicit phrase slop (ps) going on..

Jan

> 1. jul. 2021 kl. 17:10 skrev Mónica Marrero :
> 
> Hi,
> 
> I am using Solr 7.7 in Cloud with the default query parser and similarity
> algorithm. I get the following results with these queries:
> 
> q= "Wolfgang Amadeus Mozart": 8834 results.
> q= "Wolfgang Amadeus Mozart" OR "Wolfgang Amadeus Mozart": 8831 results.
> 
> To my surprise, I get 3 fewer results with the second query, and I have
> seen that those 3 documents contain the same words in a different order
> ("Mozart Wolfgang Amadeus").
> 
> In case it is relevant, the field used for the query is a textual field,
> with regular normalization (see below):
> 
>  positionIncrementGap="100">
>
>
> splitOnNumerics="0"/>
>
>
>
>
>
>
> splitOnNumerics="0"/>
>
>
>
>
> 
> 
> Does anybody know why this is happening?
> 
> Thanks in advance for your help.
> 
> Mónica
> 
> -- 
> Disclaimer: This email and any files transmitted with it are confidential 
> and intended solely for the use of the individual or entity to whom they 
> are
> addressed. If you have received this email in error please notify the 
> system manager. If you are not the named addressee you should not 
> disseminate,
> distribute or copy this email. Please notify the sender 
> immediately by email if you have received this email by mistake and delete 
> this email from your
> system.



json request api unexpected error message

2021-07-02 Thread Szűcs Roland
Dear SOLR users,

I tried to use the following json request in python:
json={
"params":{
"q": 'Roy',
"defType": 'edismax',
"qf":'name_suggest brand_suggest',
}
}

requests.post(COLLECTION + '/autocomplete', json=json).json()
The result was inline with my expectations. I know, the json request api
supports 'query' instead of 'q' as a top level parameter but I just wanted
to see how it works.

As I add one more parameter and having an updated json:
json={
"params":{
"q": 'Roy',
"defType": 'edismax',
"qf":'name_suggest brand_suggest',
"boost":get_boost_text(profile)
}
}
I get a weird seemingly unrelated error message:

"no field name specified in query and no default specified via 'df' param".

It is strange as I had qf defined in both versions and once I get an error
once not.
get_boost_text returns the following string:

'query({!v=" all_text:One^167.7 all_text:Macska^142.1
all_text:felnőtt^106.4 all_text:Cat Vital^58.9 all_text:Profine^40.4
all_text:Royal Canin^24.2 all_text:Purina^21.5
all_text:ivartalanított^21.2 all_text:fajtatáp^10.2
 all_text:Prevital^10.1 name:búza^266.4 name:lazac^103.6
name:macska^102.3 name:cat^60.9 name:profine^40.4 name:10kg^38.6
name:kitten^27.6 name:canin^24.2 name:fhn^23.4 name:chow^21.5
description:bifensis^366.0 description:jövő^307.9
 description:ivartalanítot^266.4 description:egészséges^233.5
description:is^232.4 description:cat^228.8 description:fellépő^225.2
description:alkalmasak.^224.4 description:erőszakmentes^224.4
description:gyököket.bőr^224.4 "})'.

Any idea, why it happens that adding boost parameter makes the query
invalid with an error message not related to the boost?

Thanks in advance,

Roland


Re: Fewer results when duplicating phrase in a query with OR

2021-07-02 Thread Mónica Marrero
Hi all,

it was completely my fault, I forgot about the elevation and that was the
reason why in one of the queries I had a different number of results. We
can see that also in the debug mode because then we have a ConstantScore.
Thank you so much for your answers and sorry for wasting your time.

Cheers,

Mónica

On Fri, 2 Jul 2021 at 09:42, Jan Høydahl  wrote:

> Which query parser are you using, and what is the config of your search
> handler? I suspect that there is some implicit phrase slop (ps) going on..
>
> Jan
>
> > 1. jul. 2021 kl. 17:10 skrev Mónica Marrero  >:
> >
> > Hi,
> >
> > I am using Solr 7.7 in Cloud with the default query parser and similarity
> > algorithm. I get the following results with these queries:
> >
> > q= "Wolfgang Amadeus Mozart": 8834 results.
> > q= "Wolfgang Amadeus Mozart" OR "Wolfgang Amadeus Mozart": 8831 results.
> >
> > To my surprise, I get 3 fewer results with the second query, and I have
> > seen that those 3 documents contain the same words in a different order
> > ("Mozart Wolfgang Amadeus").
> >
> > In case it is relevant, the field used for the query is a textual field,
> > with regular normalization (see below):
> >
> >  > positionIncrementGap="100">
> >
> >
> > > splitOnNumerics="0"/>
> >
> >
> >
> >
> >
> >
> > > splitOnNumerics="0"/>
> >
> >
> >
> >
> >
> >
> > Does anybody know why this is happening?
> >
> > Thanks in advance for your help.
> >
> > Mónica
> >
> > --
> > Disclaimer: This email and any files transmitted with it are
> confidential
> > and intended solely for the use of the individual or entity to whom they
> > are
> > addressed. If you have received this email in error please notify the
> > system manager. If you are not the named addressee you should not
> > disseminate,
> > distribute or copy this email. Please notify the sender
> > immediately by email if you have received this email by mistake and
> delete
> > this email from your
> > system.
>
>

-- 
Disclaimer: This email and any files transmitted with it are confidential 
and intended solely for the use of the individual or entity to whom they 
are
addressed. If you have received this email in error please notify the 
system manager. If you are not the named addressee you should not 
disseminate,
distribute or copy this email. Please notify the sender 
immediately by email if you have received this email by mistake and delete 
this email from your
system.


Re: HTTPSolrClient - help required [Singleton Recommended?]

2021-07-02 Thread Vincenzo D'Amore
the solrclients are thread safe so, yes, I recommend to use a single
instance during all the life of your application (as said I prefer have an
instance for each index/collection)
If a network problem occurs, the client will manage to reconnect
automatically to the server.
But regarding the defaults I would specify connection and socket timeout
during the cloudsolrclient creation (setConnectionTimeout, setSoTimeout).
Pay attention, the timeouts are for both zookeeper and solr, just to be
sure to handle the case if the network hangs .

On Fri, Jul 2, 2021 at 2:36 AM Reej M  wrote:

> Hi Shawn / Team ,
> Need a suggestion on using the cloudsolrclient.
> In our application, we have few cores which will be indexing every few
> minutes (starting from 15 mins intervall and searching will also be done by
> the users at the same time. Is it recommended to maintain a single
> cloudsolrclient throughout the application, something like a singleton? Im
> afraid if in a multi threaded env one thread shouldn’t hold the processing
> until the other completes. Kindly advise.
>
> Thanks
> Reej
>
> > On 30 Jun 2021, at 2:13 PM, Reej M  wrote:
> >
> > Oh ok Walter.
> > For the moment, we too cannot update to cloudsolrclient, and we are
> trying to find a way to resume the connections for now, and later work on
> the code cleanup. Thanks
> >
> >> On 30 Jun 2021, at 12:49 AM, Walter Underwood 
> wrote:
> >>
> >> CloudSolrClient is not an absolute requirement for a Solr Cloud
> cluster.
> >>
> >> We use regular HTTPSolrClient sending all requests to the load balancer.
> >> Actually, we use a separate load balancer for indexing, to keep the
> monitoring
> >> separate and to set different timeouts than for queries.
> >>
> >> This setup is simple and fast. With our biggest cluster, we index about
> a
> >> half million documents per minute.
> >>
> >> wunder
> >> Walter Underwood
> >> wun...@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >>> On Jun 29, 2021, at 9:37 AM, Reej Nayagam  wrote:
> >>>
> >>> Thanks Shawn & Vicenzo. Will check it out and change accordingly.
> Thanks
> >>> again Shawn for your clear explanation.
> >>>
> >>> Regards
> >>> Reej
> >>>
> >>>
> >>> On Tue, 29 Jun 2021 at 9:47 PM, Vincenzo D'Amore 
> wrote:
> >>>
>  Right, you should always use CloudSolrClient as a singleton.
>  To be honest I'm used to reuse a CloudSolrClient instance for each
>  collection/index.
> 
>  On Tue, Jun 29, 2021 at 3:12 PM Shawn Heisey 
> wrote:
> 
> > On 6/29/2021 6:43 AM, Reej Nayagam wrote:
> >> Hi Vincenzo Yes we are using cloud and initial solr version was
> 4.10.4
> >> and we upgraded the jars alone to 8.8.2 now in the application side
> >> connecting to solr Server to fix some vulnerability. As we have
> >> upgraded the jars we changed httpsolrserver connection to
> >> httpsolrclient and we guess there is some connection leak and wanted
> >> to check if we need to close it or it is being handled internally.
> >> Singleton not sure if I can use as the base URL changes depending on
> >> the leader.
> >
> > If you're running SolrCloud, you should be using CloudSolrClient, not
> > HttpSolrClient.  The cloud client talks to zookeeper, so it is always
> > aware of the cluster state -- it will be aware of down servers and
> new
> > servers, without recreating or reinitializing the client.  And it
> will
> > be aware of changes instantly -- because that information is
> coordinated
> > in zookeeper.
> >
> > You can use a single client object for multiple collections.  All of
> the
> > methods that execute requests should have a version where you can
> pass
> > it the name of the collection/core you want to operate on. For
> > CloudSolrClient, you point it at all your ZK servers and it figures
> out
> > the Solr server URLs  from the clusterstate in ZK.  For
> HttpSolrClient,
> > you just leave the collection name off of the base URL --
> > "http://server.example.com:8983/solr"; is an example URL.
> >
> > As was mentioned, you should create a client during program startup
> and
> > then use it to handle all requests for the life of the program. It
> > should manage connections and close them after receiving data, with
> no
> > coding needed from the developer (you).  If you close a SolrClient
> > object, it will not function after that.  If you're having
> connections
> > stay open, then either you're running a Solr or SolrJ version with
> bugs,
> > or there is something wrong with your networking.
> >
> > It shouldn't be necessary to ever close a SolrClient object, unless
> you
> > create a new one every time your program talks to Solr.   Which you
> > shouldn't do.
> >
> >
> > Thanks,
> > Shawn
> >
> >
> 
>  --
>  Vincenzo D'Amore
> 
> >>> --
> >>> *Thanks,*
> >>> *Reej*
> >>
> >
>
>

-- 
Vincenzo D'Amore


Oom in SolrIndexSearcher.buildTopDocsCollector

2021-07-02 Thread Ma, Samuel
Hi, Solr users

I have a oom issue in SolrIndexSearcher.buildTopDocsCollector in Solr7.7.2, I 
can see the stack is from below:

at 
org.apache.lucene.search.FieldComparator$LongComparator.(FieldComparator.java:359)
 ~[?:?] at org.apache.lucene.search.SortField.getComparator(SortField.java:354) 
~[?:?] at 
org.apache.lucene.search.FieldValueHitQueue.(FieldValueHitQueue.java:140) 
~[?:?] at 
org.apache.lucene.search.FieldValueHitQueue.(FieldValueHitQueue.java:32) 
~[?:?] at 
org.apache.lucene.search.FieldValueHitQueue$OneComparatorFieldValueHitQueue.(FieldValueHitQueue.java:62)
 ~[?:?] at 
org.apache.lucene.search.FieldValueHitQueue.create(FieldValueHitQueue.java:163) 
~[?:?] at 
org.apache.lucene.search.TopFieldCollector.create(TopFieldCollector.java:557) 
~[?:?] at 
org.apache.solr.search.SolrIndexSearcher.buildTopDocsCollector(SolrIndexSearcher.java:1526)
 ~[?:?] at 
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1602)
 ~[?:?] at 
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1421)
 ~[?:?] at 
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:568) 
~[?:?] at 
org.apache.solr.handler.component.QueryComponent.doProcessUngroupedSearch(QueryComponent.java:1435)
 ~[?:?] at 
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:375)
 ~[?:?] at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:298)
 ~[?:?] at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
 ~[?:?]
…

When I try to have a Arthas tool to attach Solr to 
monitor this (watch org.apache.solr.search.SolrIndexSearcher 
buildTopDocsCollector '{params, target.registryName}' "params[0]>500 and 
params[1].sort != null" -x 3 -b), I can catch below case,

method=org.apache.solr.search.SolrIndexSearcher.buildTopDocsCollector 
location=AtEnter
ts=2021-07-01 08:29:35; [cost=0.003948ms] result=@ArrayList[
@Object[][
@Integer[6032340],
@QueryCommand[
query=@BooleanQuery[(xptLabels_fr:*)^5.0 (xptLabels_bsID:*)^5.0 
(xptLabels_bs:*)^5.0 (xptName:*)^10.0 (xptLabels_br:*)^5.0 
(xptLabels_enGB:*)^5.0 (xptLabels_deCH:*)^5.0 (xptLabels:*)^5.0 
(xptLabels_ro:*)^5.0 (xptDescription:*)^5.0 (xptLabels_ja:*)^5.0 
(xptLabels_bg:*)^5.0 (xptLabels_fi:*)^5.0 (xptLabels_ru:*)^5.0 
(xptLabels_no:*)^5.0 (xptLabels_nl:*)^5.0 (xptLabels_it:*)^5.0 
(xptLabels_el:*)^5.0 (xptLabels_nb:*)^5.0 (xptLabels_vi:*)^5.0 
(xptLabels_ar:*)^5.0 (xptLabels_es:*)^5.0 (xptLabels_iw:*)^5.0 
(xptLabels_frCA:*)^5.0 (xptLabels_uk:*)^5.0 (xptLabels_enRTL:*)^5.0 
(xptLabels_hu:*)^5.0 (xptLabels_hr:*)^5.0 (xptLabels_lt:*)^5.0 
(xptAuthorName:*)^5.0 (xptLabels_da:*)^5.0 (xptLabels_pl:*)^5.0 
(xptLabels_cy:*)^5.0 (xptLabels_pt:*)^5.0 (xptLabels_tw:*)^5.0 
(xptLabels_de:*)^5.0 (xptLabels_hi:*)^5.0 (xptLabels_tr:*)^5.0 
(xptLabels_cn:*)^5.0 (xptLabels_esMX:*)^5.0 (xptLabels_th:*)^5.0 
(xptLabels_cs:*)^5.0 (xptLabels_sl:*)^5.0 (xptLabels_sk:*)^5.0 
(xptLabels_ko:*)^5.0 (xptLabels_svSE:*)^5.0 (xptLabels_sv:*)^5.0 
(xptLabels_tlPH:*)^5.0 (xptLabels_sr:*)^5.0 (xptLabels_ca:*)^5.0],
filterList=@ArrayList[isEmpty=false;size=3],
filter=null,
sort=@Sort[!],
offset=@Integer[0],
len=@Integer[2147483647],
supersetMaxDoc=@Integer[6032340],
flags=@Integer[1],
timeAllowed=@Long[-1],
cursorMark=null,
],
],
@String[solr.core.xxxbT1.shard1.replica_n11],
]

Here I just do not understand why the len in QueryCommand (The second parameter 
for SolrIndexSearcher.buildTopDocsCollector) is MAX integer 2147483647. Can you 
help me? Thank you!

--
Samuel MA
SF Platform Service
Tel: +86 13818555346




Re: Aligning Shards from different Collections on the same Solr server based on Date Range

2021-07-02 Thread Matt Kuiper
After some research, it appears the following approach may help in this
situation and relieve the requirement of collocating indexes for Joins.  It
appears one drawback maybe the types of fields supported for the JOIN field.

https://solr.apache.org/guide/8_8/other-parsers.html#cross-collection-join

Matt

On Wed, Jun 30, 2021 at 11:59 AM Matt Kuiper  wrote:

> Hi Solr Group,
>
> I am not sure the following is a viable use-case, welcoming input and any
> implementation recommendations.
>
> I would like to perform joins over two sharded collections.  Where docs
> are routed to specific shards based on a date range and are the same for
> shards in each collection.
>
> I understand that this means that the replicas from each collection that
> hold data to be joined need to be collated on the same Solr Server.   I
> have read solutions that use ADD REPLICA to add a Collection B replica to
> all SolrServers assuming Collection B has only one Shard.  For my use case
> I need Collection B to have multiple shards.
>
> *Collection ACollection B  SolrServer *
> Shard1_2020  Shard1_2020   172.33.0.1:8983_solr
> Shard2_2021  Shard2_2021   172.33.0.2:8983_solr
> Shard3_2022  Shard3_2022   172.33.0.3:8983_solr
>
> I think my question comes down to how do I break shards by a date range,
> and do it in a way that both Collections A and B would be defined by the
> same date range?  If could reliably break shards by date, and know the date
> range of the shard, I think I could use ADD REPLICA api to align.
>
> Not sure a compositeId routing approach would work, but thinking an
> implicit id may be hard to manage over time.
>
> Is an approach like this viable, concerned a bit about
> maintenance concerns, other ideas to support this join?
>
> Note: I am considering this within Time series collections...
>
> Matt
>