Re: Crawling Italian language site in Solr

2023-07-28 Thread Markus Jelsma
Hello Fiz,

This normally happens when websites are capable of responding with
translations of their content. Usually this is controlled by the client's
Accept-Lang header, and in worse cases, it is decided based on client
apparent IP address.

In Nutch you can test its output by using the bin/nutch indexchecker 
command. This is the output that is sent to search engines such as Solr. So
if the language in Solr is suddenly differnet from what you expect, then
your problem lies in what Nutch receives and sends. Hence, your problem
lies in the web crawler domain, not in Solr.

Regards,
Markus

Ps, attached files usually don't work on the mailing list.

Op vr 28 jul 2023 om 08:08 schreef Fiz N :

> Hi SOLR Experts,
>
>  In Azure VM (Linux), we have installed Solr version 8.11.2 and Nutch
> Crawler (apache-nutch-1.19). Crawling the site for Italian Language we
> added the tokenizer. *In the Solr admin screen we could see the document
> but in English language.*
>
> Please see the below attached managed schema Code Changes.
>
>
>
> Regards
>
> Fiz A.
>
>


JSON boolean query syntax with edismax as default QueryParser

2023-07-28 Thread Jane Sandberg
Hi Solr colleagues,

On Solr 8.4.1, we’ve noticed that the following types of JSON DSL queries work 
if our luceneMatchVersion is 7.1 or lower, or if our default query parser is 
set to lucene:

{"query":{"bool":{"must":[{"lucene":{"query":"plasticity","df":"title_a_index"}}]}}}

However, if the query parser is set to edismax and the luceneMatchVersion is 
7.2 or higher, the parsed query visible with debug=true becomes a complete 
mess, searching for the terms “bool” and “must”, rather than the terms we 
actually want to search for:


+(DisjunctionMaxQuery(((author_main_unstem_search:bool)^1000.0 | 
(local_subject_unstem_search:bool)^15.0 | (author_unstem_search:bool)^40.0 […]

Also while debug=true, we noticed that the JSON DSL queries get converted into 
a querystring with local params: ”{!bool must=$_tt1 }”.  So I am suspecting 
these two changes in Solr 7.2 as the reason we can’t use Boolean JSON queries 
with edismax and a recent luceneMatchVersion: 
https://solr.apache.org/docs/7_2_0/changes/Changes.html#v7.2.0.upgrade_notes.  
Does that seem correct?

Also, could this be related to the question Benjamin Armintor asked on June 23 
(subject: Changes to JSON query API/syntax in Solr 9.x?)?  I’m specifically 
curious about whether a luceneMatchVersion of 7.1 or lower still works in Solr 
9?

Thanks for your insights,

  -Jane

--
Jane Sandberg (she/her)
Library Software Engineer, Discovery and Access Services


Re: Slow softCommits under heavy load?

2023-07-28 Thread Shawn Heisey

On 7/23/23 05:24, Koen De Groote wrote:

After having a look at these files: No, I cannot share them.

What I can say is that there's a couple hundred fields, dynamicFields and
copyFields(each).

The updatehandler uses solr.DirectUpdateHandler2(the only one I can see in
the source code extending the regular updateHandler), with a max autoCommit
time of 6 and a max autoSoftCommit time of 1000


You can cause yourself no end of problems with that super short 
autoSoftCommit.  It can lead to lots of commits happening at the same time.


What I would start with is reducing the autoCommit interval, to 3 or 
15000, and greatly increasing the autoSoftCommit interval, to at least 
3, maybe even as high as 12.


Stop sending explicit commits.  You especially don't want to do a commit 
after every document ... that has the potential to be even worse than 
one autoSoftCommit per second.


If possible, you should also be indexing a lot more than one document 
per indexing request.


Thanks,
Shawn


Re: [EXTERNAL] Re: upgrade to 8.6 to 9.2

2023-07-28 Thread Shawn Heisey

On 7/21/23 09:03, Oakley, Craig (NIH/NLM/NCBI) [C] wrote:

On thing that comes to mind is to have this in your start.sh script:

export SOLR_JETTY_HOST="0.0.0.0"


This is a good point.  For security reasons similar to other software 
like MySQL, Solr 9 only listens on localhost by default.  If you want to 
access Solr outside of the server itself, you have to define the 
SOLR_JETTY_HOST environment variable as Craig mentions.


Thanks,
Shawn


Re: Add a new Shard to the collection

2023-07-28 Thread Mikhail Khludnev
Hello Hari.
If new shards are handling queries and updates well it's ok to have old
shard inactive.
You can request DELETESHARD to reclaim the disk space.

On Mon, Jul 24, 2023 at 6:19 PM HariBabu kuruva 
wrote:

> Hi All,
>
> I would like to add a new shard to the existing collection to have better
> performance.  Currently we have one shard.
>
> Solr - 8.11.1
> Nodes(servers) - 10 (Non prod - 4 nodes)
> Zookeepers-5
>
> I have tried the SPLITSHARD command in one of the non prod environments.
>
> *
> https://solrserver.corp.company.com:8981/solr/admin/collections?action=SPLITSHARD&collection=abcStore&shard=shard1
> <
> https://solrserver.corp.company.com:8981/solr/admin/collections?action=SPLITSHARD&collection=abcStore&shard=shard1
> >*
> Now i can see total 3 shards
> Shard1
> Shard1_0
> Shard1_1
>
> But Shard1 is shown as inactive. Please let me know if we need to remove
> this ?
>
> Please help me if this is the correct way of splitting the shard.
> Are there any impacts to the data because of this ?
> What are the measures to be taken  while doing this in a PROD environment.
>
> --
>
> Thanks and Regards,
>  Hari
> Mobile:9790756568
>


-- 
Sincerely yours
Mikhail Khludnev