Re: Searching for special characters in documents

Thomas Corthals Mon, 02 Sep 2024 14:16:50 -0700

Hi Thorsten,

For the sake of completeness I'm going to explain why I suggested wt=xml
but it's not actually relevant in your case.


>  Adding "wt=xml" to the query params above didn't change much...

It won't change the query results at all. I just mentioned it because it
avoids confusion about the actual query Solr gets from your request.

curl -s http://localhost:8983/solr/techproducts/select?q=b\\-d

If you do that in a shell like bash, the actual query that gets sent is b\-d.
If you're not familiar with JSON, you could mistakenly assume it was b\\-d when
looking at the raw output because JSON introduces a backslash as an escape
where the shell had stripped one.

{
  "responseHeader":{
    "status":0,
    "QTime":6,
    "params":{
      "q":"b\\-d"
    }
  },
  "response":{
    "numFound":0,
    "start":0,
    "numFoundExact":true,
    "docs":[ ]
  }
}

With wt=xml there is no potential for confusion when your query contains
backslashes.

<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">0</int>
  <lst name="params">
    <str name="q">b\-d</str>
    <str name="wt">xml</str>
  </lst>
</lst>
<result name="response" numFound="0" start="0" numFoundExact="true">
</result>
</response>

All of this is just to make sure that you're definitely debugging the
"right" query.

> Do you have idea what might causing this?

It has nothing to do with escaping special characters. It's the tokenizer.
It splits the content field into separate tokens when indexing and
searching. It considers a hyphen as a separator between tokens just like
whitespace, so both "ab-dc" and "ab dc" are indexed as ["ab", "dc"].
However, "ab_dc" is retained as a single token. And those tokens are what
you're actually matching against. So even if you successfully escape a
hyphen, there is no token in your index that will match that literal hyphen.

Try these three values for q=... to see what that means:

   - ab
   - dc
   - ab_dc

If you based your schema on the schemaless example that comes with Solr,
there should be an untokenized copy of the content field indexed under
content_str. Try adding df=content_str to search in this field and if it
exists you should get the expected results for your wildcard queries. (But
not for those without wildcards.)

Thomas

Op wo 28 aug 2024 om 11:05 schreef Thorsten Heit <th...@gmx.de.invalid>:

> Hi Thomas,
>
> > How exactly are you executing the queries? If you're using shell
> commands,
> > keep in mind that the shell will apply its own escaping rules to your
> > command parameters first so if you're not careful a backslash might
> already
> > be "eaten" before the actual request is fired of. Same goes for any
> string
> > escaping rules in the programming language you might be implementing your
> > test script in.
>
> I'm executing the queries from within Java by using solr-solrj-9.6.1.
> This is an excerpt of the code in question that is being used:
>
>
> var query = new SolrQuery(searchString)
>      setParam("q.op", SearchMode.OR == request.getSearchMode() ? "OR" :
> "AND")
>      .setFields("id", "filename")
>      .setSort("id", ORDER.asc)
>      .setStart(request.getStart())
>      .setRows(request.getRows());
>
> final QueryResponse queryResponse;
> try {
>      getReadLock().lock();
>      var solrClient = getSolrClient();
>      queryResponse = solrClient.query(query, METHOD.POST);
> } catch (SolrException ex) {
>      throw new SolrServerException(ex);
> } finally {
>      getReadLock().unlock();
> }
>
>
> I have tested the above snippet with the query strings from my initial
> post, i.e. quoted text, non-quoted text, hyphens escaped/non-escaped
> etc., but the results didn't match what I was expecting...
>
>
> Adding "wt=xml" to the query params above didn't change much...
>
> Do you have idea what might causing this?
>
>
> Regards
>
> Thorsten
>

Re: Searching for special characters in documents

Reply via email to