Re: solr query sanitizer?

Walter Underwood Wed, 29 May 2024 10:31:03 -0700

Honestly, there is a missing feature here. Solr should have a free text query 
parser. Run the query through standard tokenizer, ignore all the syntax, and 
make a bunch of word/phrase queries.


wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 29, 2024, at 10:25 AM, Dmitri Maziuk <dmitri.maz...@gmail.com> wrote:
> 
> On 5/29/24 11:43, Walter Underwood wrote:
>> I’ve done three kinds of sanity checks/fixes to avoid performance problems.
>> 1. Prevent deep paging. Have to do this every time. When a request comes in 
>> for a page past 50, it gets rewritten to the 50th page.
>> 2. Limit the size of queries. With homework help, we had people pasting in 
>> 800 word queries. Those get trimmed to 40 words. The results for 40 words 
>> were nearly the same as those for 80 words in a test a few thousand real 
>> user queries. Google only does 32.
>> 3. Removing all syntax characters (or replacing them with spaces). This gets 
>> tricky, because things like “-“ are OK inside a word. A more conservative 
>> approach is to remove “*” and “?”, so you prevent script kiddie queries like 
>> “a* b* c* d* e* f* …”
> 
> Thanks, everyone.
> 
> For #3 I think I'll steal the regexs from solarium, as Thomas suggested. #1 & 
> 2 aren't our problem ATM but are worth adding, while I'm at it.
> 
> I have doubts about reconfiguring the logging as per Misha's suggestion: 
> it'll save some disk space but exceptions themselves will still be there with 
> all their overhead... and disk is the cheapest part of it all.
> 
> And yeah, we are using the standard parser. It may be worth switching to e.g. 
> edismax, but that comes with lots of regression testing (and finding all the 
> places to test first), making it a much bigger project.
> 
> Thanks again,
> Dima
>

Re: solr query sanitizer?

Reply via email to