solr query sanitizer?

2024-05-29 Thread Dmitri Maziuk

Hi all,

our website has a search box that essentially passes its contents to 
Solr without any massaging. This works fine 99% of the time, the other 
1% is when a misbehaving bot hits it and tries stuffing all sorts of 
crap in there.


Then bad things happen: Java's overly verbose exception stack traces 
fill up the disk faster than the logs are rotated, CPU load spikes, etc.


So, question: does anyone know of a validator/sanitizer we can use clean 
up the terms before passing them on to Solr? -- My google-fu fails to 
find one.


TIA
Dima


Re: Seeking Advice: Setting up SSL in Solr 9.5 on Centos 7

2024-05-29 Thread Lee Daniel

Yh you're right, I did some more reading last night.

I tried a few different domains last night and even disabled the SNI 
Check but no luck.



I believe the issue is the 2 step process they have in the documentation 
for generating a self-signed certificate.
There is more to the process and they may have assumed we should know 
but I don't.


Thanks.

Lee


On 2024-05-28 20:56, Dmitri Maziuk wrote:

On 5/28/24 19:35, Lee Daniel wrote:

Interesting.

Based on my lack of understanding, using z.com could mean two things:

1. Would I have to edit the certificate for each extra site/node we add?
2. Or have another instance of Solr for each site?


So this is a whole different rant, but the practical result of the 
"secure by default" idiocy is that everyone gets a cert with 
CN=foo.bar and SAN=*.foo.bar and then uses it on every host they have. 
(And SANs can be in different domain too.)


Assuming you're not actually in a TLD and have a dot in your "foo.bar" 
(for SNI), you could try that. But like I said, I don't know what 
tentacles may lurk in the Java implementation. Jetty may or may not 
like it.


Dima



Re: solr query sanitizer?

2024-05-29 Thread Thomas Corthals
Solarium (a PHP client for Solr) has a helper method to escape search terms
that uses a regex to escape special characters.

https://github.com/solariumphp/solarium/blob/c2744ff706a2f0be148a45d702700fc346429679/src/Core/Query/Helper.php#L82

Thomas

Op wo 29 mei 2024 om 16:11 schreef Dmitri Maziuk :

> Hi all,
>
> our website has a search box that essentially passes its contents to
> Solr without any massaging. This works fine 99% of the time, the other
> 1% is when a misbehaving bot hits it and tries stuffing all sorts of
> crap in there.
>
> Then bad things happen: Java's overly verbose exception stack traces
> fill up the disk faster than the logs are rotated, CPU load spikes, etc.
>
> So, question: does anyone know of a validator/sanitizer we can use clean
> up the terms before passing them on to Solr? -- My google-fu fails to
> find one.
>
> TIA
> Dima
>


Re: solr query sanitizer?

2024-05-29 Thread Mikhail Khludnev
Hello Dima
You didn't mention the query parser. Perhaps
https://solr.apache.org/guide/solr/latest/query-guide/other-parsers.html#simple-query-parser
might be suitable. Regarding stacktraces in logs. I believe stack trances
might be disabled via log config pls check
https://logging.apache.org/log4j/2.x/manual/layouts.html#pattern-layout (if
solr still uses log4j2).

On Wed, May 29, 2024 at 5:12 PM Dmitri Maziuk 
wrote:

> Hi all,
>
> our website has a search box that essentially passes its contents to
> Solr without any massaging. This works fine 99% of the time, the other
> 1% is when a misbehaving bot hits it and tries stuffing all sorts of
> crap in there.
>
> Then bad things happen: Java's overly verbose exception stack traces
> fill up the disk faster than the logs are rotated, CPU load spikes, etc.
>
> So, question: does anyone know of a validator/sanitizer we can use clean
> up the terms before passing them on to Solr? -- My google-fu fails to
> find one.
>
> TIA
> Dima
>


-- 
Sincerely yours
Mikhail Khludnev


Re: solr query sanitizer?

2024-05-29 Thread Walter Underwood
I’ve done three kinds of sanity checks/fixes to avoid performance problems.

1. Prevent deep paging. Have to do this every time. When a request comes in for 
a page past 50, it gets rewritten to the 50th page.

2. Limit the size of queries. With homework help, we had people pasting in 800 
word queries. Those get trimmed to 40 words. The results for 40 words were 
nearly the same as those for 80 words in a test a few thousand real user 
queries. Google only does 32.

3. Removing all syntax characters (or replacing them with spaces). This gets 
tricky, because things like “-“ are OK inside a word. A more conservative 
approach is to remove “*” and “?”, so you prevent script kiddie queries like 
“a* b* c* d* e* f* …”

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 29, 2024, at 7:11 AM, Dmitri Maziuk  wrote:
> 
> Hi all,
> 
> our website has a search box that essentially passes its contents to Solr 
> without any massaging. This works fine 99% of the time, the other 1% is when 
> a misbehaving bot hits it and tries stuffing all sorts of crap in there.
> 
> Then bad things happen: Java's overly verbose exception stack traces fill up 
> the disk faster than the logs are rotated, CPU load spikes, etc.
> 
> So, question: does anyone know of a validator/sanitizer we can use clean up 
> the terms before passing them on to Solr? -- My google-fu fails to find one.
> 
> TIA
> Dima



Re: solr query sanitizer?

2024-05-29 Thread Dmitri Maziuk

On 5/29/24 11:43, Walter Underwood wrote:

I’ve done three kinds of sanity checks/fixes to avoid performance problems.

1. Prevent deep paging. Have to do this every time. When a request comes in for 
a page past 50, it gets rewritten to the 50th page.

2. Limit the size of queries. With homework help, we had people pasting in 800 
word queries. Those get trimmed to 40 words. The results for 40 words were 
nearly the same as those for 80 words in a test a few thousand real user 
queries. Google only does 32.

3. Removing all syntax characters (or replacing them with spaces). This gets 
tricky, because things like “-“ are OK inside a word. A more conservative 
approach is to remove “*” and “?”, so you prevent script kiddie queries like 
“a* b* c* d* e* f* …”


Thanks, everyone.

For #3 I think I'll steal the regexs from solarium, as Thomas suggested. 
#1 & 2 aren't our problem ATM but are worth adding, while I'm at it.


I have doubts about reconfiguring the logging as per Misha's suggestion: 
it'll save some disk space but exceptions themselves will still be there 
with all their overhead... and disk is the cheapest part of it all.


And yeah, we are using the standard parser. It may be worth switching to 
e.g. edismax, but that comes with lots of regression testing (and 
finding all the places to test first), making it a much bigger project.


Thanks again,
Dima



Re: solr query sanitizer?

2024-05-29 Thread Walter Underwood
Honestly, there is a missing feature here. Solr should have a free text query 
parser. Run the query through standard tokenizer, ignore all the syntax, and 
make a bunch of word/phrase queries.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 29, 2024, at 10:25 AM, Dmitri Maziuk  wrote:
> 
> On 5/29/24 11:43, Walter Underwood wrote:
>> I’ve done three kinds of sanity checks/fixes to avoid performance problems.
>> 1. Prevent deep paging. Have to do this every time. When a request comes in 
>> for a page past 50, it gets rewritten to the 50th page.
>> 2. Limit the size of queries. With homework help, we had people pasting in 
>> 800 word queries. Those get trimmed to 40 words. The results for 40 words 
>> were nearly the same as those for 80 words in a test a few thousand real 
>> user queries. Google only does 32.
>> 3. Removing all syntax characters (or replacing them with spaces). This gets 
>> tricky, because things like “-“ are OK inside a word. A more conservative 
>> approach is to remove “*” and “?”, so you prevent script kiddie queries like 
>> “a* b* c* d* e* f* …”
> 
> Thanks, everyone.
> 
> For #3 I think I'll steal the regexs from solarium, as Thomas suggested. #1 & 
> 2 aren't our problem ATM but are worth adding, while I'm at it.
> 
> I have doubts about reconfiguring the logging as per Misha's suggestion: 
> it'll save some disk space but exceptions themselves will still be there with 
> all their overhead... and disk is the cheapest part of it all.
> 
> And yeah, we are using the standard parser. It may be worth switching to e.g. 
> edismax, but that comes with lots of regression testing (and finding all the 
> places to test first), making it a much bigger project.
> 
> Thanks again,
> Dima
> 



Re: Performance Suggestion for Dense Vectors

2024-05-29 Thread David Smiley
There *is* a Solr blog site that just launched:
 https://solr.apache.org/blog.html

On Thu, Mar 28, 2024 at 3:49 PM rajani m  wrote:
>
> @Alessandro,
> Is there a solr blog site where we can submit work/articles or are you
> suggesting to post on my own site and share a link here? I prefer the
> former if there is one because there were times when I had my own,
> it hardly had any views and on top of that google blogging made me migrate
> from blogs to sites and sites got deprecated. Is there or can we have a
> solr specific wiki/blog site  where solr users can submit common features
> configs/modules configs/examples/performance metrics and so onand maybe
> have a voting/likes to confirm it works. We will have one common place to
> submit and look for.
>
>
>
> On Thu, Mar 28, 2024 at 3:33 PM rajani m  wrote:
>
> > Run the same knn queries at a slow throughput  for 30-60 minutes, this
> > should warm up disk caches with hnsw index files, and then you should see a
> > significant drop in the query time. Also make use of "fq" and reduce the
> > document space as much as you can.
> >
> > On Thu, Mar 28, 2024 at 12:50 PM Iram Tariq
> >  wrote:
> >
> >> Hi  Alessandro,
> >>
> >> Thank you for the feedback. Kindly see my comments below,
> >>
> >> *Ale*:
> >> https://www.elastic.co/blog/accelerating-vector-search-simd-instructions,
> >> I
> >> suggest to experiment with simD vector improvements  (unless you are
> >> already doing it)
> >>
> >> * We will try this soon. *
> >>
> >> *Ale*: What about the machine memory?
> >>
> >> Following is the system specification:  Linux ( CPU:64, RAM:488 GB,
> >> OS:Ubuntu 20.04.6 )
> >>
> >> *Ale*: you can fine-tune the hyper-parameter to compromise a bit on recall
> >> in favour of performance  (hnswBeamWidth, hnswMaxConnections)
> >>
> >> I am trying this as a first step. But I am sure it will impact recall.
> >>
> >> Regards,
> >>
> >>
> >> Iram Tariq | Software Architect
> >>
> >> NorthBay
> >>
> >> Direct:  +1 (902) 329-7329
> >>
> >> iram.ta...@northbaysolutions.net
> >>
> >> www.northbaysolutions.com
> >>
> >>
> >>
> >>
> >> On Thu, Mar 28, 2024 at 5:42 AM Alessandro Benedetti <
> >> a.benede...@sease.io>
> >> wrote:
> >>
> >> > That's interesting.
> >> > I think it's vital to get back some performance tests from the
> >> community.
> >> > Since my contribution to support Vector-search in Apache Solr was
> >> merged,
> >> > we got little or null feedback to understand its performance, in
> >> real-world
> >> > use cases.
> >> > Blogs, open benchmarks or even just this sort of mail message are
> >> welcome.
> >> > Let me reply in line:
> >> > --
> >> > *Alessandro Benedetti*
> >> > Director @ Sease Ltd.
> >> > *Apache Lucene/Solr Committer*
> >> > *Apache Solr PMC Member*
> >> >
> >> > e-mail: a.benede...@sease.io
> >> >
> >> >
> >> > *Sease* - Information Retrieval Applied
> >> > Consulting | Training | Open Source
> >> >
> >> > Website: Sease.io 
> >> > LinkedIn  | Twitter
> >> >  | Youtube
> >> >  | Github
> >> > 
> >> >
> >> >
> >> > On Wed, 27 Mar 2024 at 21:06, Kent Fitch  wrote:
> >> >
> >> > > Hi Iram,
> >> > >
> >> > > Is the machine doing lots of IO? If the hnsw graphs are not entirely
> >> in
> >> > > memory, performance will be poor. What JVM? You may get some benefit
> >> from
> >> > > simd support in java 21. Can you use the latest quantisation changes
> >> in
> >> > > Lucene to reduce memory footprint of the hnsw graphs? That's a large
> >> > topk,
> >> > > but I guess you need it?
> >> > >
> >> > > Best regards
> >> > >
> >> > > Kent Fitch
> >> > >
> >> > > On Thu, 28 Mar 2024, 5:12 am Iram Tariq,
> >> > >  wrote:
> >> > >
> >> > > > Hi All,
> >> > > >
> >> > > > I am using Dense vectors in SOLR and facing slowness in it. Each
> >> search
> >> > > is
> >> > > > taking 10-25 seconds. I want to reduce the time to 5 seconds (or
> >> less
> >> > > > ideally).
> >> > > >
> >> > > > Following configurations are being used.
> >> > > >
> >> > > >
> >> > > >1. *SOLR Version:* 9.3.0
> >> > > >2. *Lucene Version:* 9.7.0
> >> > >
> >> > *Ale*:
> >> >
> >> https://www.elastic.co/blog/accelerating-vector-search-simd-instructions,
> >> > I
> >> > suggest to experiment with simD vector improvements  (unless you are
> >> > already doing it)
> >> >
> >> > > >3. *Vector Dimensions*: 384
> >> > > >4. *Total Shards:* 5
> >> > > >5. *Number of Vectors (Per shard*): 43209158
> >> > > >6. *JVM for each Instance:* 35GB
> >> > >
> >> > *Ale*: What about the machine memory?
> >> >
> >> > > >7. *TopK: *1000  (Getting 1000 from each shard)
> >> > > >8. *Rows: *1000
> >> > > >9. *Vector Field Schema:  * >> > > >class="solr.DenseVectorField" hnswMaxConnections="20"
> >> > > > knnAlgorithm="hnsw"
> >> > > >vectorDimension="384" similarityFuncti

[ANNOUNCE] Apache Solr 9.6.1 released

2024-05-29 Thread Houston Putman
The Solr PMC is pleased to announce the release of Apache Solr 9.6.1.

Solr is the popular, blazing fast, open source NoSQL search platform from
the Apache Solr project. Its major features include powerful full-text
search, hit highlighting, faceted search, dynamic clustering, database
integration, rich document handling, and geospatial search. Solr is highly
scalable, providing fault tolerant distributed search and indexing, and
powers the search and navigation features of many of the world's largest
internet sites.

Solr 9.6.1 is available for immediate download at:

  

### Solr 9.6.1 Release Highlights:

* Core loading at startup is no longer capped at 60 seconds
* Replicas are ensured to be marked as down when a Solr node is started
* 'Illegal character in query' exception in the new HttpJdkSolrClient has
been fixed
* Performance regression for aliases in SolrJ has been fixed via partially
reverting a recent change
* Fixed debugging of Rerank Queries when reRankScale is used
* System file separator is now used in CachingDirectoryFactory, instead of
'/', fixing a regression on Windows

Please refer to the Upgrade Notes in the Solr Ref Guide for information on
upgrading from previous Solr versions:

  <
https://solr.apache.org/guide/solr/9_6/upgrade-notes/solr-upgrade-notes.html
>

Please read CHANGES.txt for a full list of bugfixes:

  

Thanks to all contributors!

hossman, Houston Putman, Jan Høydahl, Andy Webb, Christine Poerschke,
Aparna Suresh, David Smiley, Vincent Primault


Solr query on multivalued field , removing duplicate results and getting distinct results

2024-05-29 Thread Natarajan, Rajeswari
Hi ,

Looking to get distinct results from the query to multi value field. Field 
collapsing will not work on multi value field and not inclined to use faceting. 
Is there any other way this can be achieved.


Thanks,
Rajeswari


Re: Solr query on multivalued field , removing duplicate results and getting distinct results

2024-05-29 Thread Mikhail Khludnev
For some of these cases JSON Facets were an answer. However, topdocs
https://issues.apache.org/jira/browse/SOLR-7830 isn't available.

On Thu, May 30, 2024 at 5:41 AM Natarajan, Rajeswari
 wrote:

> Hi ,
>
> Looking to get distinct results from the query to multi value field. Field
> collapsing will not work on multi value field and not inclined to use
> faceting. Is there any other way this can be achieved.
>
>
> Thanks,
> Rajeswari
>


-- 
Sincerely yours
Mikhail Khludnev