Re: Searching for special characters in documents

Thomas Corthals Thu, 22 Aug 2024 10:49:51 -0700

Hi Thorsten,

How exactly are you executing the queries? If you're using shell commands,
keep in mind that the shell will apply its own escaping rules to your
command parameters first so if you're not careful a backslash might already
be "eaten" before the actual request is fired of. Same goes for any string
escaping rules in the programming language you might be implementing your
test script in.


With echoParams=explicit you can see what ended up being sent to Solr, but
keep in mind that JSON also uses backslash as an escape character. This is
one case where wt=xml can actually help avoid confusion.

Thomas


Op wo 14 aug 2024 15:18 schreef Thorsten Heit <th...@gmx.de.invalid>:

> Hi,
>
> this is the first time I'm writing to this list, so hi to all :-)
>
> I'm having problems querying text having special characters inside (see
>
> https://solr.apache.org/guide/solr/latest/query-guide/standard-query-parser.html#escaping-special-charaters
> ).
>
> My setup:
> Solr 9.6.1 running as a standalone server system under Java 21 on an
> internal Linux VM (Ubuntu 24.04).
>
> For testing purposes I created a new core "test" and uploaded a few
> sample documents to it:
>
>
> {
>    "responseHeader":{
>      "status":0,
>      "QTime":0,
>      "params":{
>        "q":"*:*",
>        "indent":"true",
>        "q.op":"OR",
>        "useParams":"",
>        "_":"1723547755451"
>      }
>    },
>    "response":{
>      "numFound":7,
>      "start":0,
>      "numFoundExact":true,
>      "docs":[{
>        "id":"70",
>        "resourcename":"beispiel.txt",
>        "content_type":["text/plain; charset=windows-1252"],
>        "content":[" \n \n  \n  \n  \n  \n  \n  \n  \n \n
> Suchtext:\r\nab_dc\r\n \n  "],
>        "_version_":1801822834550374400
>      },{
>        "id":"71",
>        "resourcename":"beispiel2.txt",
>        "content_type":["text/plain; charset=windows-1252"],
>        "content":[" \n \n  \n  \n  \n  \n  \n  \n  \n \n
> Suchtext:\r\nab-dc\r\n \n  "],
>        "_version_":1801823062283255808
>      },{
>        "id":"72",
>        "resourcename":"beispiel3.txt",
>        "content_type":["text/plain; charset=windows-1252"],
>        "content":[" \n \n  \n  \n  \n  \n  \n  \n  \n \n  Dies ist ein
> langer Suchtext:\r\nab-de\r\ndef+hi\r\nkl-nop\r\n \n  "],
>        "_version_":1806915982686420992
>      },{
>        "id":"73",
>        "resourcename":"beispiel4.txt",
>        "content_type":["text/plain; charset=windows-1252"],
>        "content":[" \n \n  \n  \n  \n  \n  \n  \n  \n \n
> ab-cd\r\nde-fg\r\n \n  "],
>        "_version_":1806917322172006400
>      },{
>        "id":"74",
>        "resourcename":"beispiel2-1.txt",
>        "content_type":["text/plain; charset=windows-1252"],
>        "content":[" \n \n  \n  \n  \n  \n  \n  \n  \n \n  Dies ist ein
> langer Suchtext:\r\nabedc\r\ndef+ghi\r\n \n  "],
>        "_version_":1807270704395059200
>      },{
>        "id":"75",
>        "resourcename":"beispiel2-2.txt",
>        "content_type":["text/plain; charset=windows-1252"],
>        "content":[" \n \n  \n  \n  \n  \n  \n  \n  \n \n  Dies ist ein
> langer Suchtext:\r\nab-dc\r\ndefxghi\r\n \n  "],
>        "_version_":1807270722296348672
>      },{
>        "id":"76",
>        "resourcename":"beispiel2-3.txt",
>        "content_type":["text/plain; charset=windows-1252"],
>        "content":[" \n \n  \n  \n  \n  \n  \n  \n  \n \n  Dies ist ein
> langer Suchtext:\r\nabedc\r\ndefxghi\r\n \n  "],
>        "_version_":1807270740219658240
>      }]
>    }
> }
>
>
> The problem is that I haven't found out how to correctly search for
> documents with a "-" in it by using wildcards (* and ?). Some queries
> seem to work while others don't...
>
> The query itself is basically the same:
>
> q=...&q.op=AND&fl=id,resourcename&sort=id+asc&start=0&rows=2147483647
>
> and differs only in the value of "q".
>
> My queries:
>
> q: *uchtex*
> => ok, 6 documents found (#70, #71, #72, #74, #75, #76)
>
> q: uchtex*
> => ok, 0 documents found
>
> q: Suchtex*
> => ok, 6 documents found (#70, #71, #72, #74, #75, #76)
>
> q: b?d
> => ok, 0 documents found
>
> q: b?d*
> => ok, 0 documents found
>
> q: *b-d*
> => ok, 0 documents found (because "-" isn't quoted, right?)
>
> q: *b?d*
> => not ok, only 3 documents found: #70, #74, #76
> => missing:  #71, #72, #75
>
> q: *b*d*
> => not ok, only 3 documents found: #70, #74, #76
> => (all 7 expected)
>
> q: ?b?d?
> => not ok, only 3 documents found: #70, #74, #76
> => missing:  #71, #72, #75
>
> q: ab*
> => ok, all 7 documents found
>
> q: ab*d
> => not ok, 0 documents found
> => missing: #73
>
> q: ab??d
> => not ok, 0 documents found
> => missing: #73
>
> q: ab\-dc
> => ok, 2 documents found: #71, #75
>
> q: ab\-d*
> => not ok, 0 documents found
> => missing: #71, #72, #75
>
> q: ab?d*
> => not ok, 3 documents found: #70, #74, #76
> => missing: #71, #72, #75
>
> q: *b\-d*
> => not ok, 0 documents found
> => missing: #71, #72, #75
>
> q: *b\\-d*
> => 0
>
>
> Can someone enlighten me what I'm doing wrong? Am I missing something?
> Or do I misunderstand something?
>
>
> Regards
>
> Thorsten
>
>
>

Re: Searching for special characters in documents

Reply via email to