[ANNOUNCE] Apache Solr 8.11.0 released

2021-11-17 Thread Adrien Grand
The Solr PMC is pleased to announce the release of Apache Solr 8.11

Solr is the popular, blazing fast, open source NoSQL search platform from
the Apache Solr project. Its major features include powerful full-text
search, hit highlighting, faceted search and analytics, rich document
parsing, geospatial search, extensive REST APIs as well as parallel SQL.
Solr is enterprise grade, secure and highly scalable, providing fault
tolerant distributed search and indexing, and powers the search and
navigation features of many of the world's largest internet sites.

The release is available for immediate download at:
  https://solr.apache.org/downloads.html

Please read CHANGES.txt for a detailed list of changes:
  https://solr.apache.org/docs/8_11_0/changes/Changes.html

Solr 8.11.0 Release Highlights
 * Security
   - MultiAuthPlugin (for authentication) and
MultiAuthRuleBasedAuthorizationPlugin (for authorization) classes to
support multiple authentication schemes, such as Bearer and Basic. This
allows the Admin UI to use OIDC (JWTAuthPlugin) to authenticate users while
still supporting Basic authentication for command-line tools and the
Prometheus exporter.

A summary of important changes is published in the Solr Reference Guide at
https://solr.apache.org/guide/8_11/solr-upgrade-notes.html. For the most
exhaustive list, see the full release notes at
https://solr.apache.org/docs/8_11_0/changes/Changes.html or by viewing the
CHANGES.txt file accompanying the distribution.  Solr's release notes
usually don't include Lucene layer changes.  Lucene's release notes are at
https://lucene.apache.org/core/8_11_0/changes/Changes.html

-- 
Adrien


Alternate to CDCR

2021-11-17 Thread Vaddi, Seshasai
Hi Solr Team,

I’m looking for an alternatives to SOLR CDCR, As you mentioned it’s getting 
deprecated.
Could you help me other best alternatives to achieve for disaster recovery

Sent from Mail for Windows



Re: OpenNLP dictionary-based lemmatizer memory issue

2021-11-17 Thread Spyros Kapnissis
Thank you Alessandro for your comments and getting back so quickly, that
sounds great!

On Tue, Nov 16, 2021 at 7:35 PM Alessandro Benedetti 
wrote:

> Hi,
> I've done an initial review and it looks ok to me!
> Before committing I added a couple of other committers to the loop, let's
> see if they have any insight and in a couple of weeks we merge!
> Cheers
> --
> Alessandro Benedetti
> Apache Lucene/Solr Committer
> Director, R&D Software Engineer, Search Consultant
>
> www.sease.io
>
>
> On Mon, 15 Nov 2021 at 19:53, Spyros Kapnissis  wrote:
>
> > Hi all,
> >
> > We recently identified and fixed an issue with the OpenNLP
> dictionary-based
> > lemmatizer that seems to affect all versions. It resulted in generally
> high
> > memory usage and random OOM exceptions, generally high server load during
> > both indexing and querying and overall unstable performance.
> >
> > It turns out that the issue was related with the way the dictionary is
> > cached internally in Solr/Lucene. So, instead of caching the generated
> > dictionary hashmap, the string contents were cached instead. This
> resulted
> > in having to re-generate the dictionary hashmap in-memory whenever the
> > TokenFilterFactory.create() was used. In our case, the dictionary was
> > pretty large, so the effects were magnified.
> >
> > I have submitted a patch on Lucene, but also posting here for visibility
> > and in case someone can help review & merge. The patch is available here:
> > https://github.com/apache/lucene/pull/380, and this is the corresponding
> > ticket:
> https://issues.apache.org/jira/projects/LUCENE/issues/LUCENE-10171
> >
> > Please let me know if you need any more details.
> >
> > Thanks!
> > Spyros
> >
>


NRT Searching (getting it working)

2021-11-17 Thread Derek C
Hi all,

I'm trying to get Near Real Time searching working with SOLR (so that
documents I insert, or documents I update, are visible in a SOLR query as
quickly as possible).

I've tried configuring autoCommit and autoSoftCommit in solrconfig.xml but
it's taking about 10 minutes to see the updates when doing a select on the
core.

I have about 2.2 million documents in a SOLR core (quite a lot of fields
too - maybe 40 and a lot are indexed=true as well).  I'm using
ClassicIndexSchemaFactory rather than ManagedIndexSchemaFactory.

Right now I'm running on a single VM with 16Gbytes of memory and 12GB given
to SOLR (displayed as JVM-Memory on the Dashboard page - right now it's
saying 74.3% / 8.92GB of 12.00GB in use).

My current autoComment and autoSoftCommit settings in the core's
solrconfig.xml are below.  I was hoping that the "maxtime" of 1000 (ms?)
for autoSoftCommit would mean that pretty quickly my changes to a document
would be visible in a query - but it's taking a long time (10 mins).

I've looked at the docs on
https://solr.apache.org/guide/8_10/near-real-time-searching.html but I
haven't (yet) worked out what else needs to be setup or configured.

Any help/info really appreciated

thanks, Derek

>From my solrconfig.xml: -



  1
  8640
  false



  1000


-- 
-- 
Derek Conniffe
Harvey Software Systems Ltd T/A HSSL
Telephone (IRL): 086 856 3823
Telephone (US): (650) 443 8285
Skype: dconnrt
Email: de...@hssl.ie


*Disclaimer:* This email and any files transmitted with it are confidential
and intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please delete it
(if you are not the intended recipient you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of
this information is strictly prohibited).
*Warning*: Although HSSL have taken reasonable precautions to ensure no
viruses are present in this email, HSSL cannot accept responsibility for
any loss or damage arising from the use of this email or attachments.
P For the Environment, please only print this email if necessary.


Re: NRT Searching (getting it working)

2021-11-17 Thread Shawn Heisey

On 11/17/21 7:05 AM, Derek C wrote:

Hi all,

I'm trying to get Near Real Time searching working with SOLR (so that
documents I insert, or documents I update, are visible in a SOLR query as
quickly as possible).



I have about 2.2 million documents in a SOLR core (quite a lot of fields
too - maybe 40 and a lot are indexed=true as well).  I'm using
ClassicIndexSchemaFactory rather than ManagedIndexSchemaFactory.

Right now I'm running on a single VM with 16Gbytes of memory and 12GB given
to SOLR (displayed as JVM-Memory on the Dashboard page - right now it's
saying 74.3% / 8.92GB of 12.00GB in use).





   1
   8640
   false
 

 
   1000
 



The first thing I would change is autoCommit.  Go with something like this:


  6
  false


A value of 24 hours or 100 million documents might as well not be 
configured at all.


One second is FAR too aggressive for autoSoftCommit.  Unless your index 
is super tiny, which is not a description I would apply to 2.2 million 
documents, a timeframe that low will tend to CAUSE problems.  For it to 
take 10 minutes is extremely odd, and probably indicates that there is a 
very large performance problem with your setup.


You did not indicate what version of Solr you have, or how large that 
index is on disk.


Can you gather a screenshot from the server, put it on a file sharing 
site, and provide a URL for it?  Sending it as an email attachment is 
unlikely to succeed.  This wiki page describes what I am looking for:


https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems#SolrPerformanceProblems-Askingforhelponamemory/performanceissue

It would also be useful for us to have solr.log and solr_gc.log covering 
the time period from when you index a change to when the document 
becomes visible.  The whole unedited file, not an excerpt.


Dropbox and gist are two good choices for sharing files.  There are many 
others.


The fact that you have Solr in a VM means that if there are any 
performance issues relating to the VM host, they could be translating 
into problems for Solr.  The possibilities for problems at the VM host 
level are numerous.


Thanks,
Shawn




Re: NRT Searching (getting it working)

2021-11-17 Thread Derek C
Hi Shawn,

Thanks for the help.

Now that you mention it. Just after I sent the email I did start
looking at top and I was seeing 100%, 200%, 300% CPU usage (and the VM only
has 4 cores so I was looking at maxed out cores).  I've grabbed screenshots
now and I'm not seeing those high numbers but I've also grabbed a
screenshot of the VM CPU usage (it's a VM in proxmox).

I'm not sure how to understand the solr_gc.log file (but I'd like to)

Latest SOLR gc log file:
https://storage.snaplocal.co.uk/solr_debug/solr_gc_log.txt

Top screen shot # 1:
https://storage.snaplocal.co.uk/solr_debug/Top-Screenshot%202021-11-17%20at%2015.00.45.png
Top screen shot # 2:
https://storage.snaplocal.co.uk/solr_debug/Top-Screenshot%202021-11-17%20at%2015.00.55.png
Top screen shot # 3:
https://storage.snaplocal.co.uk/solr_debug/Top-Screenshot%202021-11-17%20at%2015.01.19.png
Top screen shot # 4:
https://storage.snaplocal.co.uk/solr_debug/Top-Screenshot%202021-11-17%20at%2015.01.35.png
CPU Usage [max 1 hour] Proxmox:
https://storage.snaplocal.co.uk/solr_debug/Top-Screenshot%202021-11-17%20at%2015.02.18.png
CPU Usage [max 1 day] Proxmox:
https://storage.snaplocal.co.uk/solr_debug/Top-Screenshot%202021-11-17%20at%2015.02.25.png

Derek

On Wed, Nov 17, 2021 at 2:44 PM Shawn Heisey  wrote:

> On 11/17/21 7:05 AM, Derek C wrote:
> > Hi all,
> >
> > I'm trying to get Near Real Time searching working with SOLR (so that
> > documents I insert, or documents I update, are visible in a SOLR query as
> > quickly as possible).
> 
> > I have about 2.2 million documents in a SOLR core (quite a lot of fields
> > too - maybe 40 and a lot are indexed=true as well).  I'm using
> > ClassicIndexSchemaFactory rather than ManagedIndexSchemaFactory.
> >
> > Right now I'm running on a single VM with 16Gbytes of memory and 12GB
> given
> > to SOLR (displayed as JVM-Memory on the Dashboard page - right now it's
> > saying 74.3% / 8.92GB of 12.00GB in use).
> 
> > 
> >
> >1
> >8640
> >false
> >  
> >
> >  
> >1000
> >  
>
>
> The first thing I would change is autoCommit.  Go with something like this:
>
> 
>6
>false
> 
>
> A value of 24 hours or 100 million documents might as well not be
> configured at all.
>
> One second is FAR too aggressive for autoSoftCommit.  Unless your index
> is super tiny, which is not a description I would apply to 2.2 million
> documents, a timeframe that low will tend to CAUSE problems.  For it to
> take 10 minutes is extremely odd, and probably indicates that there is a
> very large performance problem with your setup.
>
> You did not indicate what version of Solr you have, or how large that
> index is on disk.
>
> Can you gather a screenshot from the server, put it on a file sharing
> site, and provide a URL for it?  Sending it as an email attachment is
> unlikely to succeed.  This wiki page describes what I am looking for:
>
>
> https://cwiki.apache.org/confluence/display/SOLR/SolrPerformanceProblems#SolrPerformanceProblems-Askingforhelponamemory/performanceissue
>
> It would also be useful for us to have solr.log and solr_gc.log covering
> the time period from when you index a change to when the document
> becomes visible.  The whole unedited file, not an excerpt.
>
> Dropbox and gist are two good choices for sharing files.  There are many
> others.
>
> The fact that you have Solr in a VM means that if there are any
> performance issues relating to the VM host, they could be translating
> into problems for Solr.  The possibilities for problems at the VM host
> level are numerous.
>
> Thanks,
> Shawn
>
>
>

-- 
-- 
Derek Conniffe
Harvey Software Systems Ltd T/A HSSL
Telephone (IRL): 086 856 3823
Telephone (US): (650) 443 8285
Skype: dconnrt
Email: de...@hssl.ie


*Disclaimer:* This email and any files transmitted with it are confidential
and intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please delete it
(if you are not the intended recipient you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of
this information is strictly prohibited).
*Warning*: Although HSSL have taken reasonable precautions to ensure no
viruses are present in this email, HSSL cannot accept responsibility for
any loss or damage arising from the use of this email or attachments.
P For the Environment, please only print this email if necessary.


[Operator] [ANNOUNCE] Apache Solr Operator v0.5.0 released

2021-11-17 Thread Houston Putman
The Apache Solr PMC is pleased to announce the release of the Apache Solr
Operator v0.5.0.

The Apache Solr Operator is a safe and easy way of managing a Solr
ecosystem in Kubernetes.

This release contains numerous bug fixes, optimizations, and improvements,
some of which are highlighted below. The release is available for immediate
download at:

  

*Solr Operator v0.5.0 Release Highlights:*


   - Support for Kubernetes v1.22+ (including the new Ingress APIs)
   - Support for cloud-native backups, and multiple backup repositories
   per-SolrCloud
  - GCS and S3 Backup Repositories are now fully supported (require
  Solr 8.9 and Solr 8.10 respectively)
   - SolrCloud Backup option has been removed from
   SolrCloud.spec.dataStorage.backupRestoreOptions, please use
   SolrCloud.spec.backupRepositories instead
  - When upgrading, the Solr Operator will automatically migrate the
  information to the new location
   - SolrBackup Persistence has been removed
  - Please keep the data in the shared volume, or use a cloud-native
  backup repository instead (e.g. GCS, S3)
  - Any persistence options provided will be removed and ignored
   - Introducing recurring/scheduled backup support in SolrBackup resource
   - Ability to bootstrap a custom Solr security.xml from a Secret
   - Fix for managed SolrCloud upgrades across multiple SolrCloud resources
   (with a shared zookeeper connection string)
   - Easy enablement of Solr Modules and additional libraries for SolrCloud
   - Pod Lifecycle is now customizable for SolrCloud and
   SolrPrometheusExporter
   - SolrCloud can now be run across availability zones with support for
   PodSpreadTopologyConstraints
   - Augment the available Pod customization options for provided Zookeeper
   Clusters
   - The Solr Operator now runs with liveness and readiness probes by
   default
   - The Solr Operator now provides a metrics endpoint, that is enabled by
   default when using the Solr Operator Helm chart
   - Leader election is now enabled for the Solr Operator by default, and
   supports multiple namespace watching


A summary of important changes is published in the documentation at:

  

For the most exhaustive list, see the change log on ArtifactHub or view the
git history in the solr-operator repo.

  <
https://artifacthub.io/packages/helm/apache-solr/solr-operator?modal=changelog
>

  


Solr limit in words search

2021-11-17 Thread Scott Q.
I am facing a weird issue, possibly caused by my config.

I have indexed a document which has a field called subject, subject is
defined as:




  
        
                
                
                
                
                
                
                
        
        
                
                
                
                
                
                
                
        
  

I have a document with subject field: cobrancas E-mail marketing em
dezembro, 2020 - referente ao uso de novembro

If I search for subject:"cobrancas e-mail" then it finds the document,
but if I search for subject:"cobrancas e-mail marketing" I have no
match. 

Why would this happen ?

Thank you!


Solr upgrade 3.6.1 TO 8.10.1 : ERROR Data at the root level is invalid. Line 1, position 1.

2021-11-17 Thread Heller, George A III CTR (USA)
We have an existing ASP>NET C# application that currently uses Solr 3.6.1 for 
indexing and searching of documents. Using Solr 3.6.1,everything works fine.

I built a Solr 8.10.1 server and when I try to upload documents to a Solr 
collection in 8.10.1, I get a "Data is invalid at root level" error. Any help 
with would be appreciated. 

 

CODE:

ISolrOperations solr = 
ServiceLocator.Current.GetInstance>();

var response =

solr.Extract(

new ExtractParameters(file, 
nextDocFile)

{

ExtractFormat = 
ExtractFormat.Text,

ExtractOnly = false,

AutoCommit = true

});

 

ERROR LOG:

Exception Type: XmlException Error in: 
http://localhost:63408/document/new_document.aspx?uID=2081&dType=PHP&txtPath=%2fContribution+Folders%2fPHP%2f00+-++CLOSED+FILES%2f01-Intelligence%2f01-Personnel
 Error Message: Data at the root level is invalid. Line 1, position 1.
Error Source: System.Xml
StackTrace:at System.Xml.XmlTextReaderImpl.Throw(Exception e)
   at System.Xml.XmlTextReaderImpl.Throw(String res, String arg)
   at System.Xml.XmlTextReaderImpl.ParseRootLevelWhitespace()
   at System.Xml.XmlTextReaderImpl.ParseDocumentContent()
   at System.Xml.XmlTextReaderImpl.Read()
   at System.Xml.Linq.XDocument.Load(XmlReader reader, LoadOptions options)
   at System.Xml.Linq.XDocument.Parse(String text, LoadOptions options)
   at System.Xml.Linq.XDocument.Parse(String text)
   at SolrNet.Impl.SolrBasicServer`1.SendAndParseExtract(ISolrCommand cmd)
   at SolrNet.Impl.SolrBasicServer`1.Extract(ExtractParameters parameters)
   at SolrNet.Impl.SolrServer`1.Extract(ExtractParameters parameters)
   at xyz_app.Document.new_document.Upload_File() in

 

 

 

 

Thanks,

George Heller

 

Email: george.a.heller2@mail.mil

 



 



smime.p7s
Description: S/MIME cryptographic signature


Solr limit in words search - take 2

2021-11-17 Thread Scott
My apologies for the previous e-mail…should have never sent that as html

I am facing a weird issue, possibly caused by my config.

I have indexed a document which has a field called subject, subject is
defined as:



  
        
                
                
                
                
                
                
                
        
        
                
                
                
                
                
                
                
        
  

I have a document with subject field: cobrancas E-mail marketing em
dezembro, 2020 - referente ao uso de novembro

If I search for subject:"cobrancas e-mail" then it finds
the document, but if I search for subject:"cobrancas e-mail
marketing" I have no match.

Why would this happen ?

Thank you!




Re: Solr upgrade 3.6.1 TO 8.10.1 : ERROR Data at the root level is invalid. Line 1, position 1.

2021-11-17 Thread Scott Q.
I'm no expert but I see you're expecting XML and as far as I
know, the default response in Solr 8 is JSON. Maybe check that ?



On Wednesday, 17-11-2021 at 11:16 Heller, George A III CTR (USA)
wrote:






We have an existing ASP>NET C# application that currently uses Solr
3.6.1 for indexing and searching of documents. Using Solr
3.6.1,everything works fine.



I built a Solr 8.10.1 server and when I try to upload documents to a
Solr collection in 8.10.1, I get a "Data is invalid at root level"
error. Any help with would be appreciated. 




 



CODE:



ISolrOperations solr = ServiceLocator.Current.GetInstance();



 
  var response =



   
solr.Extract(



   
new ExtractParameters(file, nextDocFile)



   
{



 
  ExtractFormat
= ExtractFormat.Text,



   
ExtractOnly = false,



   
AutoCommit = true



  
 });




 



ERROR LOG:



Exception Type: XmlException Error in:
http://localhost:63408/document/new_document.aspx?uID=2081&dType=PHP&txtPath=%2fContribution+Folders%2fPHP%2f00+-++CLOSED+FILES%2f01-Intelligence%2f01-Personnel
Error Message: Data at the root level is invalid. Line 1, position
1.Error Source: System.XmlStackTrace:    at
System.Xml.XmlTextReaderImpl.Throw(Exception e)   at
System.Xml.XmlTextReaderImpl.Throw(String res, String arg)   at
System.Xml.XmlTextReaderImpl.ParseRootLevelWhitespace()   at
System.Xml.XmlTextReaderImpl.ParseDocumentContent()   at
System.Xml.XmlTextReaderImpl.Read [1]()   at
System.Xml.Linq.XDocument.Load(XmlReader reader, LoadOptions
options)   at System.Xml.Linq.XDocument.Parse(String text,
LoadOptions options)   at System.Xml.Linq.XDocument.Parse(String
text)   at
SolrNet.Impl.SolrBasicServer`1.SendAndParseExtract(ISolrCommand
cmd)   at SolrNet.Impl.SolrBasicServer`1.Extract(ExtractParameters
parameters)   at SolrNet.Impl.SolrServer`1.Extract(ExtractParameters
parameters)   at xyz_app.Document.new [2]_document.Upload_File() in
 



 



 



 



Thanks,



George Heller



 



Email: george.a.heller2@mail.mil



 



SecurityPlus Logo Certified CE



 





Links:
--
[1] http://System.Xml.XmlTextReaderImpl.Read
[2] http://app.Document.new


Re: Solr limit in words search - take 2

2021-11-17 Thread Michael Gibney
This is not the most thorough answer, but hopefully gets you headed in the
right direction:

Very strange things can happen when your index-time analysis chain
generates "graph" token-streams (as yours does). A couple of things you
could try:
1. experiment with setting `enableGraphQueries=false` on the fieldtype
2. upgrading to solr >=8.1 may address your issue partially, via
LUCENE-8730 -- here I go out on a limb in guessing that you're not
_already_ on 8.1+ :-)
3. increase the phrase slop param, to be more lenient in matching
"phrases". (as I say this I'm not sure it would actually help your case,
because you're dealing with explicit phrases, and iirc phrase slop may only
configure _implicit_ ("pf") phrase searches?)

The _best_ approach would be to configure your index-time analysis chain(s)
so that they don't have multi-term "expand" synonyms, and WDGF either only
splits ("generate*Parts", etc.) or only catenates ("catenate*",
"preserveOriginal"). One approach that can work is to index into two
fields, each with a dedicated index-time analysis type (split or catenate).

Some relevant issues:
https://issues.apache.org/jira/browse/LUCENE-7398
https://issues.apache.org/jira/browse/LUCENE-4312

Michael

On Wed, Nov 17, 2021 at 11:18 AM Scott  wrote:

> My apologies for the previous e-mail…should have never sent that as html
>
> I am facing a weird issue, possibly caused by my config.
>
> I have indexed a document which has a field called subject, subject is
> defined as:
>
> 
>
>positionIncrementGap="100" multiValued="true">
> 
> 
>  generateWordParts="1" generateNumberParts="0" splitOnCaseChange="1"
> catenateWords="1" catenateNumbers="1" preserveOriginal="1"
> splitOnNumerics="0"/>
> 
> 
>  protected="protwords.txt"/>
> 
>  maxGramSize="45" />
> 
> 
> 
>  synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>  generateWordParts="1" generateNumberParts="0" splitOnCaseChange="1"
> catenateWords="1" catenateNumbers="1" splitOnNumerics="0"/>
> 
> 
>  protected="protwords.txt"/>
> 
> 
>   
>
> I have a document with subject field: cobrancas E-mail marketing em
> dezembro, 2020 - referente ao uso de novembro
>
> If I search for subject:"cobrancas e-mail" then it
> finds
> the document, but if I search for subject:"cobrancas e-mail
> marketing" I have no match.
>
> Why would this happen ?
>
> Thank you!
>
>
>


RE: Solr limit in words search - take 2

2021-11-17 Thread Scott
Thanks Michael, let me look at those links.

I forgot to mention initially but I'm running version 8.6.2 Cloud/ZooKeeper


-Original Message-
From: Michael Gibney  
Sent: Wednesday, November 17, 2021 12:07 PM
To: users@solr.apache.org
Subject: Re: Solr limit in words search - take 2

This is not the most thorough answer, but hopefully gets you headed in the 
right direction:

Very strange things can happen when your index-time analysis chain generates 
"graph" token-streams (as yours does). A couple of things you could try:
1. experiment with setting `enableGraphQueries=false` on the fieldtype 2. 
upgrading to solr >=8.1 may address your issue partially, via
LUCENE-8730 -- here I go out on a limb in guessing that you're not _already_ on 
8.1+ :-) 3. increase the phrase slop param, to be more lenient in matching 
"phrases". (as I say this I'm not sure it would actually help your case, 
because you're dealing with explicit phrases, and iirc phrase slop may only 
configure _implicit_ ("pf") phrase searches?)

The _best_ approach would be to configure your index-time analysis chain(s) so 
that they don't have multi-term "expand" synonyms, and WDGF either only splits 
("generate*Parts", etc.) or only catenates ("catenate*", "preserveOriginal"). 
One approach that can work is to index into two fields, each with a dedicated 
index-time analysis type (split or catenate).

Some relevant issues:
https://issues.apache.org/jira/browse/LUCENE-7398
https://issues.apache.org/jira/browse/LUCENE-4312

Michael

On Wed, Nov 17, 2021 at 11:18 AM Scott  wrote:

> My apologies for the previous e-mail…should have never sent that as 
> html
>
> I am facing a weird issue, possibly caused by my config.
>
> I have indexed a document which has a field called subject, subject is 
> defined as:
>
> 
>
>positionIncrementGap="100" multiValued="true">
> 
> 
>  generateWordParts="1" generateNumberParts="0" splitOnCaseChange="1"
> catenateWords="1" catenateNumbers="1" preserveOriginal="1"
> splitOnNumerics="0"/>
> 
> 
>  protected="protwords.txt"/>
> 
>  maxGramSize="45" />
> 
> 
> 
>  synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>  generateWordParts="1" generateNumberParts="0" splitOnCaseChange="1"
> catenateWords="1" catenateNumbers="1" splitOnNumerics="0"/>
> 
> 
>  protected="protwords.txt"/>
> 
> 
>   
>
> I have a document with subject field: cobrancas E-mail marketing 
> em dezembro, 2020 - referente ao uso de novembro
>
> If I search for subject:"cobrancas e-mail" then it 
> finds the document, but if I search for  name="q">subject:"cobrancas e-mail marketing" I have no match.
>
> Why would this happen ?
>
> Thank you!
>
>
>



Re: NRT Searching (getting it working)

2021-11-17 Thread Andy Lester
> 
> I'm not sure how to understand the solr_gc.log file (but I'd like to)

There’s a product called gceasy at gceasy.io  .  You can get 
a basic report on your GC log from uploading your log to them for analysis.

Andy

RE: Solr limit in words search - take 2

2021-11-17 Thread Scott
Could this be related ?

https://solr.apache.org/guide/6_6/filter-descriptions.html#FilterDescriptions-WordDelimiterGraphFilter

"If you use this filter during indexing, you must follow it with a Flatten 
Graph Filter to squash tokens on top of one another like the Word Delimiter 
Filter, because the indexer can’t directly consume a graph. To get fully 
correct positional queries when tokens are split, you should instead use this 
filter at query time."



-Original Message-
From: Michael Gibney  
Sent: Wednesday, November 17, 2021 12:07 PM
To: users@solr.apache.org
Subject: Re: Solr limit in words search - take 2

This is not the most thorough answer, but hopefully gets you headed in the 
right direction:

Very strange things can happen when your index-time analysis chain generates 
"graph" token-streams (as yours does). A couple of things you could try:
1. experiment with setting `enableGraphQueries=false` on the fieldtype 2. 
upgrading to solr >=8.1 may address your issue partially, via
LUCENE-8730 -- here I go out on a limb in guessing that you're not _already_ on 
8.1+ :-) 3. increase the phrase slop param, to be more lenient in matching 
"phrases". (as I say this I'm not sure it would actually help your case, 
because you're dealing with explicit phrases, and iirc phrase slop may only 
configure _implicit_ ("pf") phrase searches?)

The _best_ approach would be to configure your index-time analysis chain(s) so 
that they don't have multi-term "expand" synonyms, and WDGF either only splits 
("generate*Parts", etc.) or only catenates ("catenate*", "preserveOriginal"). 
One approach that can work is to index into two fields, each with a dedicated 
index-time analysis type (split or catenate).

Some relevant issues:
https://issues.apache.org/jira/browse/LUCENE-7398
https://issues.apache.org/jira/browse/LUCENE-4312

Michael

On Wed, Nov 17, 2021 at 11:18 AM Scott  wrote:

> My apologies for the previous e-mail…should have never sent that as 
> html
>
> I am facing a weird issue, possibly caused by my config.
>
> I have indexed a document which has a field called subject, subject is 
> defined as:
>
> 
>
>positionIncrementGap="100" multiValued="true">
> 
> 
>  generateWordParts="1" generateNumberParts="0" splitOnCaseChange="1"
> catenateWords="1" catenateNumbers="1" preserveOriginal="1"
> splitOnNumerics="0"/>
> 
> 
>  protected="protwords.txt"/>
> 
>  maxGramSize="45" />
> 
> 
> 
>  synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>  generateWordParts="1" generateNumberParts="0" splitOnCaseChange="1"
> catenateWords="1" catenateNumbers="1" splitOnNumerics="0"/>
> 
> 
>  protected="protwords.txt"/>
> 
> 
>   
>
> I have a document with subject field: cobrancas E-mail marketing 
> em dezembro, 2020 - referente ao uso de novembro
>
> If I search for subject:"cobrancas e-mail" then it 
> finds the document, but if I search for  name="q">subject:"cobrancas e-mail marketing" I have no match.
>
> Why would this happen ?
>
> Thank you!
>
>
>



Re: Solr limit in words search - take 2

2021-11-17 Thread Michael Gibney
Right, sorry I forgot to mention the absence of FlattenGraphFilter. Tbh I'm
not 100% clear on what cases it helps out with; but at the end of the day
it has no effect on underlying issues having to do with the fact that if
your index-time analysis chain produces "graph" tokenstreams, the Lucene
`[Default]IndexingChain` completely disregards the PositionLengthAttribute,
which is necessary to properly reconstruct the indexed graph at query time.

It's possible FlattenGraphFilter might help your case -- in fact if you do
nothing else I'd certainly suggest that you use it. But I'm certain that
there are some classes of problems that are fundamentally related to
LUCENE-4312, and FlattenGraphFilter can't fix them. I'll be curious to know
whether the addition of FlattenGraphFilter helps in your case, though!

Michael

On Wed, Nov 17, 2021 at 12:57 PM Scott  wrote:

> Could this be related ?
>
>
> https://solr.apache.org/guide/6_6/filter-descriptions.html#FilterDescriptions-WordDelimiterGraphFilter
>
> "If you use this filter during indexing, you must follow it with a Flatten
> Graph Filter to squash tokens on top of one another like the Word Delimiter
> Filter, because the indexer can’t directly consume a graph. To get fully
> correct positional queries when tokens are split, you should instead use
> this filter at query time."
>
>
>
> -Original Message-
> From: Michael Gibney 
> Sent: Wednesday, November 17, 2021 12:07 PM
> To: users@solr.apache.org
> Subject: Re: Solr limit in words search - take 2
>
> This is not the most thorough answer, but hopefully gets you headed in the
> right direction:
>
> Very strange things can happen when your index-time analysis chain
> generates "graph" token-streams (as yours does). A couple of things you
> could try:
> 1. experiment with setting `enableGraphQueries=false` on the fieldtype 2.
> upgrading to solr >=8.1 may address your issue partially, via
> LUCENE-8730 -- here I go out on a limb in guessing that you're not
> _already_ on 8.1+ :-) 3. increase the phrase slop param, to be more lenient
> in matching "phrases". (as I say this I'm not sure it would actually help
> your case, because you're dealing with explicit phrases, and iirc phrase
> slop may only configure _implicit_ ("pf") phrase searches?)
>
> The _best_ approach would be to configure your index-time analysis
> chain(s) so that they don't have multi-term "expand" synonyms, and WDGF
> either only splits ("generate*Parts", etc.) or only catenates ("catenate*",
> "preserveOriginal"). One approach that can work is to index into two
> fields, each with a dedicated index-time analysis type (split or catenate).
>
> Some relevant issues:
> https://issues.apache.org/jira/browse/LUCENE-7398
> https://issues.apache.org/jira/browse/LUCENE-4312
>
> Michael
>
> On Wed, Nov 17, 2021 at 11:18 AM Scott  wrote:
>
> > My apologies for the previous e-mail…should have never sent that as
> > html
> >
> > I am facing a weird issue, possibly caused by my config.
> >
> > I have indexed a document which has a field called subject, subject is
> > defined as:
> >
> > 
> >
> >> positionIncrementGap="100" multiValued="true">
> > 
> > 
> >  > generateWordParts="1" generateNumberParts="0" splitOnCaseChange="1"
> > catenateWords="1" catenateNumbers="1" preserveOriginal="1"
> > splitOnNumerics="0"/>
> > 
> > 
> >  > protected="protwords.txt"/>
> > 
> >  minGramSize="2"
> > maxGramSize="45" />
> > 
> > 
> > 
> >  > synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> >  > generateWordParts="1" generateNumberParts="0" splitOnCaseChange="1"
> > catenateWords="1" catenateNumbers="1" splitOnNumerics="0"/>
> > 
> > 
> >  > protected="protwords.txt"/>
> > 
> > 
> >   
> >
> > I have a document with subject field: cobrancas E-mail marketing
> > em dezembro, 2020 - referente ao uso de novembro
> >
> > If I search for subject:"cobrancas e-mail" then it
> > finds the document, but if I search for  > name="q">subject:"cobrancas e-mail marketing" I have no match.
> >
> > Why would this happen ?
> >
> > Thank you!
> >
> >
> >
>
>


Re: NRT Searching (getting it working)

2021-11-17 Thread Derek C
That's an amazing online tool - thanks Andy

(I think looking at the generated charts/graphs that the Garbage collection
and memory usage is OK)

Derek

On Wed, Nov 17, 2021 at 5:53 PM Andy Lester  wrote:

> >
> > I'm not sure how to understand the solr_gc.log file (but I'd like to)
>
> There’s a product called gceasy at gceasy.io  .  You
> can get a basic report on your GC log from uploading your log to them for
> analysis.
>
> Andy



-- 
-- 
Derek Conniffe
Harvey Software Systems Ltd T/A HSSL
Telephone (IRL): 086 856 3823
Telephone (US): (650) 443 8285
Skype: dconnrt
Email: de...@hssl.ie


*Disclaimer:* This email and any files transmitted with it are confidential
and intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please delete it
(if you are not the intended recipient you are notified that disclosing,
copying, distributing or taking any action in reliance on the contents of
this information is strictly prohibited).
*Warning*: Although HSSL have taken reasonable precautions to ensure no
viruses are present in this email, HSSL cannot accept responsibility for
any loss or damage arising from the use of this email or attachments.
P For the Environment, please only print this email if necessary.


RE: Solr limit in words search - take 2

2021-11-17 Thread Scott
Ok, I'll add  in the indexer 
and see what happens.

It's so weird that it works, even in this state, when the docs say : This 
filter _must_ be included

I would have expected the indexer to throw errors if this filter is really 
required...

Thanks!

-Original Message-
From: Michael Gibney  
Sent: Wednesday, November 17, 2021 1:15 PM
To: users@solr.apache.org
Subject: Re: Solr limit in words search - take 2

Right, sorry I forgot to mention the absence of FlattenGraphFilter. Tbh I'm not 
100% clear on what cases it helps out with; but at the end of the day it has no 
effect on underlying issues having to do with the fact that if your index-time 
analysis chain produces "graph" tokenstreams, the Lucene 
`[Default]IndexingChain` completely disregards the PositionLengthAttribute, 
which is necessary to properly reconstruct the indexed graph at query time.

It's possible FlattenGraphFilter might help your case -- in fact if you do 
nothing else I'd certainly suggest that you use it. But I'm certain that there 
are some classes of problems that are fundamentally related to LUCENE-4312, and 
FlattenGraphFilter can't fix them. I'll be curious to know whether the addition 
of FlattenGraphFilter helps in your case, though!

Michael

On Wed, Nov 17, 2021 at 12:57 PM Scott  wrote:

> Could this be related ?
>
>
> https://solr.apache.org/guide/6_6/filter-descriptions.html#FilterDescr
> iptions-WordDelimiterGraphFilter
>
> "If you use this filter during indexing, you must follow it with a 
> Flatten Graph Filter to squash tokens on top of one another like the 
> Word Delimiter Filter, because the indexer can’t directly consume a 
> graph. To get fully correct positional queries when tokens are split, 
> you should instead use this filter at query time."
>
>
>
> -Original Message-
> From: Michael Gibney 
> Sent: Wednesday, November 17, 2021 12:07 PM
> To: users@solr.apache.org
> Subject: Re: Solr limit in words search - take 2
>
> This is not the most thorough answer, but hopefully gets you headed in 
> the right direction:
>
> Very strange things can happen when your index-time analysis chain 
> generates "graph" token-streams (as yours does). A couple of things 
> you could try:
> 1. experiment with setting `enableGraphQueries=false` on the fieldtype 2.
> upgrading to solr >=8.1 may address your issue partially, via
> LUCENE-8730 -- here I go out on a limb in guessing that you're not 
> _already_ on 8.1+ :-) 3. increase the phrase slop param, to be more 
> lenient in matching "phrases". (as I say this I'm not sure it would 
> actually help your case, because you're dealing with explicit phrases, 
> and iirc phrase slop may only configure _implicit_ ("pf") phrase 
> searches?)
>
> The _best_ approach would be to configure your index-time analysis
> chain(s) so that they don't have multi-term "expand" synonyms, and 
> WDGF either only splits ("generate*Parts", etc.) or only catenates 
> ("catenate*", "preserveOriginal"). One approach that can work is to 
> index into two fields, each with a dedicated index-time analysis type (split 
> or catenate).
>
> Some relevant issues:
> https://issues.apache.org/jira/browse/LUCENE-7398
> https://issues.apache.org/jira/browse/LUCENE-4312
>
> Michael
>
> On Wed, Nov 17, 2021 at 11:18 AM Scott  wrote:
>
> > My apologies for the previous e-mail…should have never sent that as 
> > html
> >
> > I am facing a weird issue, possibly caused by my config.
> >
> > I have indexed a document which has a field called subject, subject 
> > is defined as:
> >
> > 
> >
> >> positionIncrementGap="100" multiValued="true">
> > 
> > 
> >  > generateWordParts="1" generateNumberParts="0" splitOnCaseChange="1"
> > catenateWords="1" catenateNumbers="1" preserveOriginal="1"
> > splitOnNumerics="0"/>
> > 
> > 
> >  > protected="protwords.txt"/>
> > 
> >  minGramSize="2"
> > maxGramSize="45" />
> > 
> > 
> > 
> >  > synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> >  > generateWordParts="1" generateNumberParts="0" splitOnCaseChange="1"
> > catenateWords="1" catenateNumbers="1" splitOnNumerics="0"/>
> > 
> > 
> >  > protected="protwords.txt"/>
> > 
> > 
> >   
> >
> > I have a document with subject field: cobrancas E-mail 
> > marketing em dezembro, 2020 - referente ao uso de novembro
> >
> > If I search for subject:"cobrancas e-mail" then 
> > it finds the document, but if I search for  > name="q">subject:"cobrancas e-mail marketing" I have no match.
> >
> > Why would this happen ?
> >
> > Thank you!
> >
> >
> >
>
>



Re: NRT Searching (getting it working)

2021-11-17 Thread Andy Lester


> On Nov 17, 2021, at 12:41 PM, Derek C  wrote:
> 
> That's an amazing online tool - thanks Andy

It was Shawn Heisey who pointed me to it.  There many other JVM GC tools out 
there if you search a bit.

https://sematext.com/blog/java-gc-log-analysis-tools/

omitting term frequencies but keeping positions?

2021-11-17 Thread Edward Turner
Hi there,

Is there a way to omit only term frequencies but keep positions? I see it's
possible to omit frequencies and positions, or just positions for a field
in a schema.xml (
https://solr.apache.org/guide/8_5/field-type-definitions-and-properties.html#field-default-properties),
but we would like to omit term frequencies only. In a way, we assumed we
could achieve this with:

omitTermFreqAndPositions="true" // forget term frequencies, forget positions
omitPositions="false" // ... but actually, keep positions
// ==> forget term frequencies only

However, this looks just a bit weird ... and probably isn't how these
options are intended to be combined?

We also understand we can extend DefaultSimilarity (if it's still called
that?) and return a constant, like, 1.0f for all term frequencies. However,
from our recollection this requires creating a plugin and adding it to
Solr's classpath -- which is possible, but is additional work which we'd
rather not have to do if it is already an out of the box option.

Context: our use case is coming from our scientific curators saying:
- term frequencies are "overpowering" the contribution to the score of
length normalisation. E.g., if Document1 has a multivalued field with the
value, "Albumin", and someone searches this field for "Albumin", then they
*really* want that document back first. Instead, Document2 whose field
value is ["Albumin D box-binding protein", "Albumin D-element-binding
protein", "2S albumin"] is coming higher.
- they state that they're not actually interested in how many times a term
appears, it's either there or not.

Just to clarify from my above waffle, is it possible to omit term
frequencies only, but keep positions?

Many thanks, kind regards and have a nice day,

Edd

PS. I've seen this question asked a few times over the years, but wanted to
ask again in case I've missed a new option in Solr.


Edward Turner


Re: Solr limit in words search

2021-11-17 Thread Shawn Heisey

On 11/17/21 9:00 AM, Scott Q. wrote:

I am facing a weird issue, possibly caused by my config.

I have indexed a document which has a field called subject, subject is
defined as:


 -- the definition you included is blank in the email that I got.  
I do not know why.  If it was an email attachment, the mailing list eats 
almost all attachments that get sent.



I have a document with subject field: cobrancas E-mail marketing em
dezembro, 2020 - referente ao uso de novembro

If I search for subject:"cobrancas e-mail" then it finds the document,
but if I search for subject:"cobrancas e-mail marketing" I have no
match.

Why would this happen ?


There could be a lot of reasons.  My best guess at the moment is that 
you have stemming configured on the analysis chain and the phrase search 
(quotes) is making that NOT happen on the query analysis.  The analysis 
tab in the admin UI unfortunately cannot show you what happens with a 
phrase query.  Ordinarily I would suggest using that to see what 
happens, but in this case we can't do that.


Can you share your schema file? It is usually named managed-schema (with 
no extension) or schema.xml, depending on solrconfig.xml.


Also, if you add a "debugQuery=true" parameter to the query request, you 
can see how Solr ultimately analyzes and parses the query.  I would like 
to see the full response with debug enabled, both on the search that 
succeeds and the one that fails.  And if you can do another search for 
subject:(cobrancas e-mail marketing), replacing the quotes with 
parentheses, I would like to see the debug output from that as well.


What version of Solr, and was it installed from the binary release download?

Thanks,
Shawn




Re: NRT Searching (getting it working)

2021-11-17 Thread Shawn Heisey

On 11/17/21 8:24 AM, Derek C wrote:

Now that you mention it. Just after I sent the email I did start
looking at top and I was seeing 100%, 200%, 300% CPU usage (and the VM only
has 4 cores so I was looking at maxed out cores).  I've grabbed screenshots
now and I'm not seeing those high numbers but I've also grabbed a
screenshot of the VM CPU usage (it's a VM in proxmox).


VMs make things a little muddy.  There could be something going on at 
the VM host level that affects the VM, especially if the total amount of 
memory given to all the VMS is larger than the amount of memory actually 
installed on the VM host.  Such oversubscription is extremely common in 
virtualized environments, and if that is happening, it will absolutely 
destroy performance across the board.



I'm not sure how to understand the solr_gc.log file (but I'd like to)


The best *easy* resource is what you were already told about -- the 
gceasy.io website.  There are a number of other tools available that can 
analyze the logs locally with more detail, but the important bits are 
shown by gceasy.  It's what I use.


There are no significant GC pauses.  The amount of heap space used shown 
by the GC logs is extremely small, far smaller than the 12GB you had 
mentioned before.  If that level of heap usage is typical, you could 
most likely drastically reduce the max heap size, leaving more memory 
available for the OS to use for disk caching. Effective disk caching is 
the key to making Solr perform well.


The solr.log file is missing.  That contains a lot of the info I was 
hoping to see.


The screenshot from top looks like your Solr instance has somewhere in 
the ballpark of 20GB of index data, but there is only 4GB of data in the 
disk cache.  That would be about right if you are using a 12GB heap on a 
system with 16GB of memory.  Maybe the VM and/or the physical host 
doesn't have enough memory?


Thanks,
Shawn




Execute bulk partial update on solr

2021-11-17 Thread Karan Jain
Hi *Everyone*,

We want to update a field of more than one document without ID being
passed. We may need to update 1.5 M documents in one go. It seems to us
that executing atomic update

 for a field in Solr cluster (8.4.1) is a good way to do that. As specified
in
https://solr.apache.org/guide/8_2/updating-parts-of-documents.html#field-storage,
we have checked all fields in managed-schema file are having stored = true.
We have below questions in regards to that:-

1) Can we try atomic update

 for updating a field of more than one document without ID being passed? If
not, please suggest another way to do this.
2) Do we need to stop the traffic (read and write) while updating a field
with atomic update?
3) Whether the atomic update is possible if we want to update 1.5 M
documents in one go. Can we execute add, set etc. operations specified in
https://solr.apache.org/guide/8_2/updating-parts-of-documents.html#atomic-updates
 for the same key value pair for about 1.5 M documents?

Please let us know if we have any doubts.

Best,
Karan


Re: Solr upgrade 3.6.1 TO 8.10.1 : ERROR Data at the root level is invalid. Line 1, position 1.

2021-11-17 Thread Shawn Heisey

On 11/17/21 9:16 AM, Heller, George A III CTR (USA) wrote:


We have an existing ASP>NET C# application that currently uses Solr 
3.6.1 for indexing and searching of documents. Using Solr 
3.6.1,everything works fine.


I built a Solr 8.10.1 server and when I try to upload documents to a 
Solr collection in 8.10.1, I get a "Data is invalid at root level" 
error. Any help with would be appreciated.




Scott's answer is very likely the problem.  Solr switched to JSON as the 
default response writer in version 7.0.  Before that it was XML.


You can either add a "wt" parameter set to "xml" in the SolrNET code, or 
add it to the defaults section of the handler definition in 
solrconfig.xml.  I have no idea how to get SolrNET to do it ... that 
client is third party, not developed by the Solr project.


The solrconfig.xml file developed by the dovecot team for their Solr 
integration has this as a top-level element (subordinate only to the 
"" element) to switch the default back to XML:


  

Thanks,
Shawn




Re: Execute bulk partial update on solr

2021-11-17 Thread Shawn Heisey

On 11/17/21 5:45 PM, Karan Jain wrote:

1) Can we try atomic update

  for updating a field of more than one document without ID being passed? If
not, please suggest another way to do this.


I have never heard of any such capability.  As far as I am aware, you 
would have to send indexing requests that include an update for each 
ID.  You can (and should) batch them.  It would not be a good idea to 
send 1.5 million individual requests.



2) Do we need to stop the traffic (read and write) while updating a field
with atomic update?


For standard atomic updates, no.  For the in-place update capability, 
which has VERY specific requirements that typically most people cannot 
meet, I am not sure.



3) Whether the atomic update is possible if we want to update 1.5 M
documents in one go. Can we execute add, set etc. operations specified in
https://solr.apache.org/guide/8_2/updating-parts-of-documents.html#atomic-updates
  for the same key value pair for about 1.5 M documents?



Solr is a search engine, not a database.  An RDBMS (one example being 
mysql) can easily do this, but Solr can't.  Solr thinks in terms of 
documents, not tables comprised of columns.


Thanks,
Shawn




Re: [ANNOUNCE] Apache Solr 8.11.0 released

2021-11-17 Thread David Smiley
I'll start the Docker image release process now; it should be out by the
weekend hopefully.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Wed, Nov 17, 2021 at 3:54 AM Adrien Grand  wrote:

> The Solr PMC is pleased to announce the release of Apache Solr 8.11
>
> Solr is the popular, blazing fast, open source NoSQL search platform from
> the Apache Solr project. Its major features include powerful full-text
> search, hit highlighting, faceted search and analytics, rich document
> parsing, geospatial search, extensive REST APIs as well as parallel SQL.
> Solr is enterprise grade, secure and highly scalable, providing fault
> tolerant distributed search and indexing, and powers the search and
> navigation features of many of the world's largest internet sites.
>
> The release is available for immediate download at:
>   https://solr.apache.org/downloads.html
>
> Please read CHANGES.txt for a detailed list of changes:
>   https://solr.apache.org/docs/8_11_0/changes/Changes.html
>
> Solr 8.11.0 Release Highlights
>  * Security
>- MultiAuthPlugin (for authentication) and
> MultiAuthRuleBasedAuthorizationPlugin (for authorization) classes to
> support multiple authentication schemes, such as Bearer and Basic. This
> allows the Admin UI to use OIDC (JWTAuthPlugin) to authenticate users while
> still supporting Basic authentication for command-line tools and the
> Prometheus exporter.
>
> A summary of important changes is published in the Solr Reference Guide at
> https://solr.apache.org/guide/8_11/solr-upgrade-notes.html. For the most
> exhaustive list, see the full release notes at
> https://solr.apache.org/docs/8_11_0/changes/Changes.html or by viewing
> the CHANGES.txt file accompanying the distribution.  Solr's release notes
> usually don't include Lucene layer changes.  Lucene's release notes are at
> https://lucene.apache.org/core/8_11_0/changes/Changes.html
>
> --
> Adrien
>


RE: Solr limit in words search

2021-11-17 Thread Scott
Thanks Shawn, not sure if you saw, but I resent without html formatting and it 
came through fine. I'll put it here again along with the preliminary conclusion 
that I was missing the Flatten filter in my indexer. Here are the schema 
details + output you requested:



  


















  

Original query between quotes, no matches:

  subject:"cobrancas e\-mail marketing"
  subject:"cobrancas e\-mail marketing"
  SpanNearQuery(spanNear([subject:cobranca, 
spanOr([subject:email, spanNear([subject:e, subject:mail], 0, true)]), 
subject:marketing], 0, true))
  spanNear([subject:cobranca, 
spanOr([subject:email, spanNear([subject:e, subject:mail], 0, true)]), 
subject:marketing], 0, true)
  LuceneQParser


Original query without 'marketing' between quotes, matches:

  subject:"cobrancas e\-mail"
  subject:"cobrancas e\-mail"
  SpanNearQuery(spanNear([subject:cobranca, 
spanOr([subject:email, spanNear([subject:e, subject:mail], 0, true)])], 0, 
true))
  spanNear([subject:cobranca, 
spanOr([subject:email, spanNear([subject:e, subject:mail], 0, true)])], 0, 
true)
  LuceneQParser
  
  


  true
  27.416113
  weight(spanNear([subject:cobranca, 
spanOr([subject:email, spanNear([subject:e, subject:mail], 0, true)])], 0, 
true) in 748821) [SchemaSimilarity], result of:
  

  true
  27.416113
  score(freq=1.0), computed as boost * idf * tf 
from:
  

  true
  2.2
  boost


  true
  19.544073
  idf, sum of:
  

  true
  9.65364
  idf, computed as log(1 + (N - n + 
0.5) / (n + 0.5)) from:
  

  true
  1906
  n, number of documents containing 
term


  true
  29700198
  N, total number of documents with 
field

  


  true
  4.8891644
  idf, computed as log(1 + (N - n + 
0.5) / (n + 0.5)) from:
  

  true
  223574
  n, number of documents containing 
term


  true
  29700198
  N, total number of documents with 
field

  


  true
  5.0012693
  idf, computed as log(1 + (N - n + 
0.5) / (n + 0.5)) from:
  

  true
  199864
  n, number of documents containing 
term


  true
  29700198
  N, total number of documents with 
field

  

  


  true
  0.63762903
  tf, computed as freq / (freq + k1 * (1 - 
b + b * dl / avgdl)) from:
  

  true
  1.0
  phraseFreq=1.0


  true
  1.2
  k1, term saturation parameter


  true
  0.75
  b, length normalization 
parameter


  true
  12.0
  dl, length of field


  true
  40.25195
  avgdl, average length of field

  

  

  



Original query, between (), matches, but it also matches other unwanted 
documents such as 'marketing plans', etc

subject:(cobrancas e\-mail marketing)
  subject:(cobrancas e\-mail marketing)
  subject:cobranca (subject:email (+subject:e 
+subject:mail)) subject:marketing
  subject:cobranca (subject:email (+subject:e 
+subject:mail)) subject:marketing
  LuceneQParser
  
   

  true
  29.841742
  sum of:
  

  true
  13.541982
  weight(subject:cobranca in 748821) 
[SchemaSimilarity], result of:
  

  true
  13.541982
  score(freq=1.0), computed as boost * idf 
* tf from:
  

Re: Solr limit in words search

2021-11-17 Thread Shawn Heisey

On 11/17/2021 9:36 PM, Scott wrote:

Thanks Shawn, not sure if you saw, but I resent without html formatting and it 
came through fine. I'll put it here again along with the preliminary conclusion 
that I was missing the Flatten filter in my indexer. Here are the schema 
details + output you requested:


It looks like the stemmer isn't doing what I suspected, so that theory 
is out.  I added your fieldType to a Solr install that I have, and did 
some fiddling with the analysis tab.  I'm running version 8.11.0, the 
one that just got released.


It is possible that missing the flatten filter could cause problems. 
You'll need to reindex after adding it.


And it is always possible that you're running into a bug that might be 
fixed in a later release than you're running.


Thanks,
Shawn


Re: Incremental backup for Standalone Solr

2021-11-17 Thread Abeleshev Artyom
Thanks, Jason, for the detailed answer. Now I got the point and It makes
sense.

Best regards,
Artem Abeleshev


On Mon, Nov 15, 2021 at 10:05 PM Jason Gerlowski 
wrote:

> Hey Artem,
>
> Incremental backups were written primarily with SolrCloud in mind.
> Many of the APIs (backup listing, backup deletion, etc.) work only in
> SolrCloud, and most of our automated tests around backups focus on
> SolrCloud setups.
>
> That said, incremental backup in SolrCloud relies on doing incremental
> backup at the core-level, and there are internal APIs that expose this
> that you can hit in standalone mode if you'd really like to.  These
> APIs are considered "internal", so they're subject to change and
> intended for Solr's own use under the hood and they're a bit ugly.
> But it is possible to use them in standalone mode for doing
> backup/restore.
>
> In general the backupcore API call looks like:
>
> /solr/admin/cores?
> action=BACKUPCORE&
> core=coreName&
> shardBackupId=md_coreName_1&
> prevShardBackupId=md_coreName_0&
> repository=someBackupRepoName&
> location=/some/location
>
> 'action' and 'core' should be pretty self-explanatory.
>
> The 'repository' and 'location' parameters are explained in the
> collection-level docs here:
>
> https://solr.apache.org/guide/8_10/collection-management.html#backup-parameters
> .
> There's also more info here on 'repository' in particular:
>
> https://solr.apache.org/guide/8_10/making-and-restoring-backups.html#backuprestore-storage-repositories
>
> (Note that the 'location' should already exist, and should contain two
> subdirectories: 'index', and 'shard_backup_metadata'.  These are
> details that are handled for users in SolrCloud, but that need to be
> handled manually when using the core-level APIs in standalone mode.)
>
> The least intuitive parameters are 'shardBackupId' and
> 'prevShardBackupId'.  In a standalone/core-level frame of reference,
> these effectively name the current backup ('shardBackupId') and
> provide a pointer to the previous incremental backup for the core
> ('prevShardBackupId') if there is one.  The values for these params
> are pretty inflexible: I'd advocate always using 'md_coreName_0',
> 'md_coreName_1', 'md_coreName_2', etc.
>
> So on successive backups of the same core, you might use params like:
> -
> "action=BACKUPCORE&core=coreName&shardBackupId=md_coreName_0&repository=someRepo&location=/some/location"
> -
> "action=BACKUPCORE&core=coreName&shardBackupId=md_coreName_1&prevShardBackupId=md_coreName_0&repository=someRepo&location=/some/location"
> - and so on.
>
> TL;DR - there are "internal" APIs for incremental backups in
> standalone mode, but (at this point) they're an implementation detail
> of the SolrCloud support that's subject to change. Standalone users
> should exercise their best judgement in deciding whether to use these,
> or some of the other standalone backup options
> (/replication?command=backup, /solr/admin/cores?action=CREATESNAPSHOT,
> etc.)
>
> Best,
>
> Jason
>
> On Sun, Nov 14, 2021 at 9:53 PM Abeleshev Artyom 
> wrote:
> >
> > On the previous version of Solr (started from 8.9) a new incremental
> backup
> > support has been added. It was based on the following proposal
> >
> https://cwiki.apache.org/confluence/display/SOLR/SIP-12%3A+Incremental+Backup+and+Restore
> .
> > JIRA issue https://issues.apache.org/jira/browse/SOLR-15086 for the SIP
> > that was used for managing subtasks is closed with the comment that all
> the
> > included proposals are being implemented. I see some updates on
> > documentation for using incremental backup for SolrCloud, but what about
> > Standalone Solr? Is there a reason why it is not announced that
> incremental
> > support is also added for Standalone Solr? I don't see any mentions about
> > incremental backup on Standalone Solr documentation. I've checked the
> > sources and it seems that everything is in place and support is included
> > (subissue https://issues.apache.org/jira/browse/SOLR-13608 was closed at
> > version 8.9). Is it still not reliable and incremental backups have some
> > issues? Or maybe just forgot to update documentation files?
> >
> > Best regards,
> > Artem Abeleshev
>