Honestly, there is a missing feature here. Solr should have a free text query parser. Run the query through standard tokenizer, ignore all the syntax, and make a bunch of word/phrase queries.
wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On May 29, 2024, at 10:25 AM, Dmitri Maziuk <dmitri.maz...@gmail.com> wrote: > > On 5/29/24 11:43, Walter Underwood wrote: >> I’ve done three kinds of sanity checks/fixes to avoid performance problems. >> 1. Prevent deep paging. Have to do this every time. When a request comes in >> for a page past 50, it gets rewritten to the 50th page. >> 2. Limit the size of queries. With homework help, we had people pasting in >> 800 word queries. Those get trimmed to 40 words. The results for 40 words >> were nearly the same as those for 80 words in a test a few thousand real >> user queries. Google only does 32. >> 3. Removing all syntax characters (or replacing them with spaces). This gets >> tricky, because things like “-“ are OK inside a word. A more conservative >> approach is to remove “*” and “?”, so you prevent script kiddie queries like >> “a* b* c* d* e* f* …” > > Thanks, everyone. > > For #3 I think I'll steal the regexs from solarium, as Thomas suggested. #1 & > 2 aren't our problem ATM but are worth adding, while I'm at it. > > I have doubts about reconfiguring the logging as per Misha's suggestion: > it'll save some disk space but exceptions themselves will still be there with > all their overhead... and disk is the cheapest part of it all. > > And yeah, we are using the standard parser. It may be worth switching to e.g. > edismax, but that comes with lots of regression testing (and finding all the > places to test first), making it a much bigger project. > > Thanks again, > Dima >