Re: Urgent help needed on Solr cloud(cont)

2021-06-22 Thread SayantiGmail


Hi

This seems to be a bug in Solr 8.4 .Will this get resolved in higher versions 
or we need to update the stall time configuration as a workaround.

> On 12 Apr 2021, at 22:41, Carlos .Sponchiado  wrote:
> 
> 1. I found a similar issue in this version of Solr here
> https://github.com/clarin-eric/VLO/issues/291 , They suggest using
> solr.cloud.client.stallTime
> to mitigate it. But I think fixing the commit issue will solve this problem
> too.
> 4. The delete query was supposed to only mark the document inside each
> segment as deleted.
> 
> Do you know if the throughput of updating documents increased a lot? In the
> stats of SolrAdmin is possible to see how frequently commits are happening.
> If it is a lot, this fix of avoid send commit command and have hardCommit
> and SoftCommit configured can help you. Let's wait for other suggestions
> here too.
> 
>> Em seg., 12 de abr. de 2021 às 18:51, Rekha Sekhar 
>> escreveu:
>> 
>> Hi,
>> 
>> Thank you for the information. We will try out the below suggestions.
>> 
>> I have a few more questions, which are facing in the application.
>> 
>> The application has *2 solr (v 8.4.1)* and *3 Zookeeper (v3.6.2) *running
>> in SolrCloud mode.
>> After running it for few days we could see below error in logs -
>> 2021-04-12 11:04:26.850 ERROR (qtp1632497828-1072) [c:datacore s:shard1
>> r:core_node4 x:datacore_shard1_replica_n2] o.a.s.u.SolrCmdDistributor
>> java.io.IOException: *Request processing has stalled for 90083ms with 100
>> remaining elements in the queue*.
>> and
>> 2021-04-12 09:00:36.350 ERROR (qtp1632497828-786) [c:datacore s:shard1
>> r:core_node4 x:datacore_shard1_replica_n2] o.a.s.s.HttpSolrCall
>> null:java.io.IOException: *Task queue processing has stalled for 90175 ms
>> with 92 remaining elements to process.*
>> 
>> 
>> 1. What do these error messages mean? How can we resolve this?
>> 2. After getting these messages,  the 2 Solr nodes show different document
>> count and delete count. It seems the 2 Solr nodes are not sync(screenshot
>> attached for reference).
>> 3. One of the nodes (not leader) goes to a recovering state forever.
>> 4.In the solr update requests of 1 Lakh records, there are few thousands of
>> delete query as well. Do  the delete query introduce more slowness in
>> synching the nodes
>> 
>> The above messages are coming frequently from both the Solr nodes and
>> finally one node goes to a recovering state forever.
>> 
>> Could you please help by answering the above queries.
>> 
>> Thanks,
>> Rekha
>> 
> 
> 
> -- 
> Abraços
> Carlos Sponchiado


Re: LTR on child documents

2021-06-22 Thread Alessandro Benedetti
Hi Roopa,
can you elaborate a bit better?
Child documents are documents nonetheless, so you can query them, rank them
and re0rank them.
So, what are you trying to do?
Are you using any block join related query parsers?
Do you want to combine it with learning to rank?
How?
Let us know and we can try to help!

--
Alessandro Benedetti
Apache Lucene/Solr Committer
Director, R&D Software Engineer, Search Consultant

www.sease.io


On Mon, 21 Jun 2021 at 19:12, Roopa Rao  wrote:

> Hi,
>
> Is there a way to get the feature score on fields on the childDocuments in
> LTR.
> Note that there could be multiple childDocuments.
>
> If there are example features that can be provided which uses query on the
> child documents will be helpful
>
> Thank you,
> Roopa
>


Is 2x drive size necessary for recovery on SOLR 8?

2021-06-22 Thread Stephen Lewis Bianamara
Hi SOLR Community,

I've been investigating SOLR 8 and recovery behavior. From what I can tell,
older SOLR versions (6 and before at least) required a solrdata drive with
at least 2x the space of the index; so an index which could be up to 100GB
on one shard would require a disk with at least 200GB of storage space,
since recovery would copy a brand new index over and then switch after the
fact.

However, SOLR 8 looks to have a different behavior wherein the index is
perhaps updated in place, and thus a 100GB / shard index might only need a
bit more headroom (call it 110GB say). Is this always the case with
recovery on SOLR 8+? Or are there some situations where you might need
200GB for the recovery?

Thanks in advance!
Stephen


Re: LTR on child documents

2021-06-22 Thread Roopa Rao
Hi Alessandro,

Thanks for the response.

I am using version Solr 6.6

Basically, I have parent documents and multiple child documents for each
parent.
I am trying to create features and get feature scores on parent attributes
(which is straightforward) and child attributes (which is what I am looking
for examples).
And then rerank documents based on an LTR model which will use these
features (on both parent & child).

Enclosing the parent-child structure and the query construction (where you
can see I am using a child transformer) and also a sample feature, current
sample features score output.



abc-1
feature-1281835650
parent-docs
Parent title1 text1 
Parent summary1 text2


child-doc-12345
feature-1281835650
child-docs
9735
A sample child doc desc
2021-09-01T00:00:00Z

productA
productB
productC



child-doc-56788
feature-1281835650
child-docs
3426
A sample child doc desc -
2
2021-09-02T00:00:00Z

productD
productE
productF




Query construction:
Get documents based on search of parent document attributes - title,
summary.
Get the corresponding child document in the response, filtered by products
eligible to, restrict to 1 (for display purposes).
Rerank based on a model and get feature scores on both attributes on parent
and attributes on child.

https://localhost:8983/testhandler?q=trying to
test&rows=100&start=0&q.op=AND&timeAllowed=2&fl=id,score,title_s,summary_s,[features
store=testFeatureStore],[explain]&fl=_childDocuments_&fl=[child
parentFilter='type:parent-docs' childFilter='((type:(child-docs)) AND
({!terms f=child_doc_product_s_mult}productA,productD))'
limit=1]&sort=score
desc&fq=(type:(parent-docs))&defType=edismax&df=title_qf_default&q.alt=*:*&pf=title_qf_default^2.0
title_qf_synonym^2.0&qf=title_qf_default^2.0
summary_qf_default^1.0&sow=false&lowercaseOperators=false&tie=0.0&rq={!ltr
model=testModel reRankDocs=100 efi.uq=$q}

Here trying to create features for title_s match with uq (user query) -
which works fine since it is on the parent.
Sample feature:
{
"name": "feature_title_match",
"class": "org.apache.solr.ltr.feature.SolrFeature",
"params": {
"q": "{!dismax qf=title_qf_default}${uq:none}"
},
"store": "testFeatureStore"
}

Similarly I need to create features for child_doc_resource_desc_s match
with uq (user query) and a few more features on child_doc attributes.

Right now I am getting scores on the parent attributes as expected.

*However writing features on the child document attributes and getting
feature scores for them is what I am looking for:*
Current Sample output:

originalScore=61.886406,feature_title_match=8.004536,feature_summary_match=3.340,feature_title_synonym_expansion=15.471601

Similarly want to build features: feature_child_doc_resource_desc which
would give a feature score of user query matched with
child_doc_resource_desc_s attribute of the child doc:

So an output like this may be? or feature scores on child comes under each
child document not sure
originalScore=61.886406,feature_title_match=8.004536,feature_summary_match=3.340,feature_title_synonym_expansion=15.471601,feature_child_doc_resource_desc=value1


Thank you!
Roopa

On Tue, Jun 22, 2021 at 1:04 PM Alessandro Benedetti 
wrote:

> Hi Roopa,
> can you elaborate a bit better?
> Child documents are documents nonetheless, so you can query them, rank them
> and re0rank them.
> So, what are you trying to do?
> Are you using any block join related query parsers?
> Do you want to combine it with learning to rank?
> How?
> Let us know and we can try to help!
>
> --
> Alessandro Benedetti
> Apache Lucene/Solr Committer
> Director, R&D Software Engineer, Search Consultant
>
> www.sease.io
>
>
> On Mon, 21 Jun 2021 at 19:12, Roopa Rao  wrote:
>
> > Hi,
> >
> > Is there a way to get the feature score on fields on the childDocuments
> in
> > LTR.
> > Note that there could be multiple childDocuments.
> >
> > If there are example features that can be provided which uses query on
> the
> > child documents will be helpful
> >
> > Thank you,
> > Roopa
> >
>


Re: Is 2x drive size necessary for recovery on SOLR 8?

2021-06-22 Thread Shawn Heisey

On 6/22/2021 11:45 AM, Stephen Lewis Bianamara wrote:

However, SOLR 8 looks to have a different behavior wherein the index is
perhaps updated in place, and thus a 100GB / shard index might only need a
bit more headroom (call it 110GB say). Is this always the case with
recovery on SOLR 8+? Or are there some situations where you might need
200GB for the recovery?



The general recommendation, for normal operation and not just recovery, 
is to ensure you have enough space available so that the index can 
triple in size temporarily.  The 3x requirement only comes about with a 
very specific set of circumstances involving reindexing in-place on an 
existing index -- for MOST usage, you want enough space for the index to 
double in size temporarily. But because we cannot be sure how you are 
going to use Solr, we always err on the side of caution and tell people 
the index could triple in size before it goes back down.


Thanks,
Shawn



Re: Is 2x drive size necessary for recovery on SOLR 8?

2021-06-22 Thread Stephen Lewis Bianamara
Thanks Shawn! That is really helpful to know. Can you say more about what
circumstance might cause an index to triple in size? Is it connected with
bulk operations like "optimize" which can be avoided, or is it inherent to
situations like merging segments? And if so, can this requirement be
adjusted by an appropriate setting of maxMergedSegmentMB or something
similar?

I guess I'm wondering if there is any info or references I could look at to
determine what the limit should be for a given case even if the general
guidance is that 3x is needed.

Thanks!

On Tue, Jun 22, 2021 at 1:05 PM Shawn Heisey  wrote:

> On 6/22/2021 11:45 AM, Stephen Lewis Bianamara wrote:
> > However, SOLR 8 looks to have a different behavior wherein the index is
> > perhaps updated in place, and thus a 100GB / shard index might only need
> a
> > bit more headroom (call it 110GB say). Is this always the case with
> > recovery on SOLR 8+? Or are there some situations where you might need
> > 200GB for the recovery?
>
>
> The general recommendation, for normal operation and not just recovery,
> is to ensure you have enough space available so that the index can
> triple in size temporarily.  The 3x requirement only comes about with a
> very specific set of circumstances involving reindexing in-place on an
> existing index -- for MOST usage, you want enough space for the index to
> double in size temporarily. But because we cannot be sure how you are
> going to use Solr, we always err on the side of caution and tell people
> the index could triple in size before it goes back down.
>
> Thanks,
> Shawn
>
>


Re: Is 2x drive size necessary for recovery on SOLR 8?

2021-06-22 Thread Shawn Heisey

On 6/22/2021 2:24 PM, Stephen Lewis Bianamara wrote:

Thanks Shawn! That is really helpful to know. Can you say more about what
circumstance might cause an index to triple in size? Is it connected with
bulk operations like "optimize" which can be avoided, or is it inherent to
situations like merging segments? And if so, can this requirement be
adjusted by an appropriate setting of maxMergedSegmentMB or something
similar?



Any merge, whether it's optimize (forcemerge) or normal merging, can 
involve the entire index.


Let's say you have an index that has a number of very large segments.  
Either you optimized it at some point or it's just been running for a 
long time and has reached that state naturally.


You begin a reindexing process.  This process hits almost all the 
documents in the index, but a few are left untouched.


Those few untouched documents mean that the segments containing them 
must stick around, even though they're comprised almost entirely of 
deleted documents.


At this point, without even doing an optimize, the index has doubled in 
size -- the original segments are still there because they contain a few 
not-deleted docs, and all the new data is in new segments.  In practice, 
some of those older segments probably got merged and shrank, but we're 
discussing worst-case scenarios here, so pretend for a moment that they 
have not been merged away.


Then either you do some more indexing that results in a super-large 
merge, or run an optimize.  At this point, with the index already 
doubled in size, that further merging could add the whole index again 
before it deletes the older segments and you're back to 1x.


Realistically, you probably need enough space for the index to reach 
2.5x when doing in-place reindexing, but if the planets all align just 
right, you could need 3x.  If you never reindex the whole thing in place 
(without either creating a new index or deleting the existing one) then 
you would only need 2x.  But because sometimes the planets do align just 
right, I tell people to have 3x just in case.


Thanks
Shawn



Re: Is 2x drive size necessary for recovery on SOLR 8?

2021-06-22 Thread Dave
The 3x index size has been around for a long time. Usually it’s for a full 
optimize.  When this happens the original index stays in place, 1x, and is 
being reconstructed, 2x, then merged into the replacement 3x, once it’s all 
done you are back to less than 1x but you need the space or the optimize will 
fail.  The new rules are that you never optimize but you will always want that 
extra space just in case, and disks are cheap, 

> On Jun 22, 2021, at 4:24 PM, Stephen Lewis Bianamara 
>  wrote:
> 
> Thanks Shawn! That is really helpful to know. Can you say more about what
> circumstance might cause an index to triple in size? Is it connected with
> bulk operations like "optimize" which can be avoided, or is it inherent to
> situations like merging segments? And if so, can this requirement be
> adjusted by an appropriate setting of maxMergedSegmentMB or something
> similar?
> 
> I guess I'm wondering if there is any info or references I could look at to
> determine what the limit should be for a given case even if the general
> guidance is that 3x is needed.
> 
> Thanks!
> 
>> On Tue, Jun 22, 2021 at 1:05 PM Shawn Heisey  wrote:
>> 
>>> On 6/22/2021 11:45 AM, Stephen Lewis Bianamara wrote:
>>> However, SOLR 8 looks to have a different behavior wherein the index is
>>> perhaps updated in place, and thus a 100GB / shard index might only need
>> a
>>> bit more headroom (call it 110GB say). Is this always the case with
>>> recovery on SOLR 8+? Or are there some situations where you might need
>>> 200GB for the recovery?
>> 
>> 
>> The general recommendation, for normal operation and not just recovery,
>> is to ensure you have enough space available so that the index can
>> triple in size temporarily.  The 3x requirement only comes about with a
>> very specific set of circumstances involving reindexing in-place on an
>> existing index -- for MOST usage, you want enough space for the index to
>> double in size temporarily. But because we cannot be sure how you are
>> going to use Solr, we always err on the side of caution and tell people
>> the index could triple in size before it goes back down.
>> 
>> Thanks,
>> Shawn
>> 
>> 


Re: Excessive query expansion when using WordDelimiterGraphFilter

2021-06-22 Thread Francis Crimmins
Hi Michael:

Thanks for your detailed response, which is much appreciated. We’ll take a look 
at some of your suggestions.

We know that if we drop back to using the old WordDelimiterFilter we do not see 
this behaviour, but we would prefer not to use a deprecated class if possible.

It would be great if there was some parameter or configuration that would allow 
some kind of limit or threshold to be set to avoid this behaviour. We’re 
concerned that specifying a low “maxbooleanclauses” setting may adversely 
affect other parts of the query processing (although we’re not sure if that 
would be the case).

Cheers,

Francis.

--
Francis Crimmins | Senior Software Engineer | National Library of Australia
M: +61 0433 545 884 | E: fcrimm...@nla.gov.au | nla.gov.au

The National Library of Australia (NLA) acknowledges Australia’s First Nations 
Peoples – the First Australians – as the Traditional Owners and Custodians of 
this land and gives respect to the Elders – past and present – and through them 
to all Australian Aboriginal and Torres Strait Islander people.


From: Michael Gibney 
Date: Tuesday, 22 June 2021 at 12:50 pm
To: users@solr.apache.org 
Subject: Re: Excessive query expansion when using WordDelimiterGraphFilter
Hi Francis,

I have indeed encountered this problem -- though as you've discovered
it's dependent on specific types of analysis chain config.

You should be able to avoid this behavior by constructing your
analysis chains such that they don't in practice create graph
TokenStreams. This advice generally applies more strongly at
index-time, because the graph structure of the TokenStream is
discarded at index-time. But in a case like yours (a relatively large
index, and queries that have proven to be problematic) I think your
intuition is correct that you could both:
1. lower the maxBooleanClauses limit as a failsafe, and also
2. mitigate negative functional consequences by modifying your
analysis chains to reduce or eliminate this exponential expansion in
the first place

My recommendation would be to index into two separate fulltext fields:
one with wdgf configured to only split (i.e.,
"generate*Parts"/"splitOn", etc.), one with wdgf configured to only
concatenate (i.e., "preserveOriginal", "catenate*") This should
prevent graph token streams from being created, and should thus
prevent the kind of exponential expansion you're seeing here.

If you really want to have your cake and eat it too (to the extent
that's possible at the moment), you could use two fields configured as
mentioned above for _phrase_ searching (i.e.,
`pf=fulltext_split,fulltext_catenate`), and have a third field with
wgdf configured to split _and_ catenate (which I infer is the current
configuration?) and use that as the query field (i.e.,
`qf=fulltext_split_and_catenate`). The problem is really the implicit
phrase queries (`pf`) in this case; so in fact you might get passable
results by simply disabling `pf` altogether (setting it to empty) --
though that would be a very blunt instrument, so I wouldn't recommend
disabling `pf` unless absolutely necessary.

The bigger picture here is also interesting (to me, at least!): there
were some initial steps towards more general application of true
"positional graph queries" (e.g., LUCENE-7638 [1]). But due to
problems fundamental to positional graph queries (well described in
LUCENE-7398 [2], though not unique to span queries), much of this
functionality has been effectively removed from the most common query
parsers (e.g., LUCENE-8477 [3], LUCENE-9207 [4]). The timing of your
question resonates with me, as I've recently been working to add
benchmarks that illustrate the performance impact of this "exponential
expansion" behavior (see the comments tacked onto the end of
LUCENE-9204 [5]).

Michael

[1] https://issues.apache.org/jira/browse/LUCENE-7638
[2] https://issues.apache.org/jira/browse/LUCENE-7398
[3] https://issues.apache.org/jira/browse/LUCENE-8477
[4] https://issues.apache.org/jira/browse/LUCENE-9207
[5] https://issues.apache.org/jira/browse/LUCENE-9204

ps- Some further relevant issues/blog posts:
https://issues.apache.org/jira/browse/LUCENE-4312

https://opensourceconnections.com/blog/2018/02/20/edismax-and-multiterm-synonyms-oddities/
https://lucidworks.com/post/multi-word-synonyms-solr-adds-query-time-support/
http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html
https://www.elastic.co/blog/multitoken-synonyms-and-graph-queries-in-elasticsearch
https://michaelgibney.net/lucene/graph/


On Mon, Jun 21, 2021 at 8:50 PM Francis Crimmins  wrote:
>
> Hi:
>
> We are currently upgrading our Solr instance from 5.1.0. to 8.7.0, with an 
> index of over 250 million documents. This powers the search at:
>
> https://trove.nla.gov.au/
>
> We are using the WordDelimiterGraphFilter in the filter chain for queries:
>
> 
> https://solr.apache.org/guide/8_7/filter-descriptions.html#word-delimiter-graph-filter
>
> and have found that

Re: Is 2x drive size necessary for recovery on SOLR 8?

2021-06-22 Thread Stephen Lewis Bianamara
In my experience, disks are not always cheap :) Running in AWS I have found
several contexts which require local storage for cost effective performance
of SOLR, but that does require scaling the instance as a whole to increase
capacity (hence the particular motivation for this question).

Generally the use cases I am considering don't re-index the whole index
inplace, but rather I have used an A/B strategy to stand up a parallel
cluster and index to that the cut over using some other method (aliases or
DNS draining depending on the context). So as far as a re-indexing
operation is concerned, this seems controllable by favoring certain
methodologies.

The merging considerations are certainly interesting and naunced. Has there
been any investigation into a "minimum number of segments" setting which
could force a minimum number of segments (say 5 or 10) so that no one
segment operation could involve the entire index?

On Tue, Jun 22, 2021 at 1:37 PM Dave  wrote:

> The 3x index size has been around for a long time. Usually it’s for a full
> optimize.  When this happens the original index stays in place, 1x, and is
> being reconstructed, 2x, then merged into the replacement 3x, once it’s all
> done you are back to less than 1x but you need the space or the optimize
> will fail.  The new rules are that you never optimize but you will always
> want that extra space just in case, and disks are cheap,
>
> > On Jun 22, 2021, at 4:24 PM, Stephen Lewis Bianamara <
> stephen.bianam...@gmail.com> wrote:
> >
> > Thanks Shawn! That is really helpful to know. Can you say more about
> what
> > circumstance might cause an index to triple in size? Is it connected with
> > bulk operations like "optimize" which can be avoided, or is it inherent
> to
> > situations like merging segments? And if so, can this requirement be
> > adjusted by an appropriate setting of maxMergedSegmentMB or something
> > similar?
> >
> > I guess I'm wondering if there is any info or references I could look at
> to
> > determine what the limit should be for a given case even if the general
> > guidance is that 3x is needed.
> >
> > Thanks!
> >
> >> On Tue, Jun 22, 2021 at 1:05 PM Shawn Heisey 
> wrote:
> >>
> >>> On 6/22/2021 11:45 AM, Stephen Lewis Bianamara wrote:
> >>> However, SOLR 8 looks to have a different behavior wherein the index is
> >>> perhaps updated in place, and thus a 100GB / shard index might only
> need
> >> a
> >>> bit more headroom (call it 110GB say). Is this always the case with
> >>> recovery on SOLR 8+? Or are there some situations where you might need
> >>> 200GB for the recovery?
> >>
> >>
> >> The general recommendation, for normal operation and not just recovery,
> >> is to ensure you have enough space available so that the index can
> >> triple in size temporarily.  The 3x requirement only comes about with a
> >> very specific set of circumstances involving reindexing in-place on an
> >> existing index -- for MOST usage, you want enough space for the index to
> >> double in size temporarily. But because we cannot be sure how you are
> >> going to use Solr, we always err on the side of caution and tell people
> >> the index could triple in size before it goes back down.
> >>
> >> Thanks,
> >> Shawn
> >>
> >>
>


Re: Is 2x drive size necessary for recovery on SOLR 8?

2021-06-22 Thread Shawn Heisey

On 6/22/2021 5:02 PM, Stephen Lewis Bianamara wrote:

The merging considerations are certainly interesting and naunced. Has there
been any investigation into a "minimum number of segments" setting which
could force a minimum number of segments (say 5 or 10) so that no one
segment operation could involve the entire index?



Since Solr 7.5, the merge defaults are a lot better.  I think the "no 
segment larger than 5GB" setting even applies to optimize, but I'm not 
completely positive.  Erick Erickson is familiar with the nitty gritty 
details on that.


If you never run an optimize, you should be fine.  And since you go with 
the A/B option for reindexing, you might not ever run into the 3x 
requirement.  But if you're dealing with bare metal servers, disks are 
cheap, so it's a good idea to have LOTS of free space.  If you're going 
AWS or some other cloud solution, you'll probably want to be more aware 
of realistic requirements.


Thanks,
Shawn



Re: Is 2x drive size necessary for recovery on SOLR 8?

2021-06-22 Thread Stephen Lewis Bianamara
Thanks Shawn! I will definitely be interested to explore this space
cautiously in that case.

This was a ton of great info I needed, and more than I initially knew to
ask :) Your first response seems to imply an answer to my original
question, but I wanted to follow up to be as sure as I can. In the recovery
scenario, are there situations where a complete index will copy over next
to the original index, thus requiring 2x the disk space? Or is that now
outdated? I could imagine for example the replacement is now done on each
segment at a smaller scale or something along those lines and so recovery
requirements would expect to be on par with merge requirements, or perhaps
there is some "bad enough" scenario where a full side-by-side copy is made
during recovery. Can you comment on that?

Thanks!

On Tue, Jun 22, 2021 at 5:33 PM Shawn Heisey  wrote:

> On 6/22/2021 5:02 PM, Stephen Lewis Bianamara wrote:
> > The merging considerations are certainly interesting and naunced. Has
> there
> > been any investigation into a "minimum number of segments" setting which
> > could force a minimum number of segments (say 5 or 10) so that no one
> > segment operation could involve the entire index?
>
>
> Since Solr 7.5, the merge defaults are a lot better.  I think the "no
> segment larger than 5GB" setting even applies to optimize, but I'm not
> completely positive.  Erick Erickson is familiar with the nitty gritty
> details on that.
>
> If you never run an optimize, you should be fine.  And since you go with
> the A/B option for reindexing, you might not ever run into the 3x
> requirement.  But if you're dealing with bare metal servers, disks are
> cheap, so it's a good idea to have LOTS of free space.  If you're going
> AWS or some other cloud solution, you'll probably want to be more aware
> of realistic requirements.
>
> Thanks,
> Shawn
>
>