Hello dear Solr Team
In my documents I use fields with field type text_de. Lately I came across
weird behavior with this field type. To reproduce these weird behavior I have
set up a local solr verver on my machine. There I could identify the same
behavior.
To reproduce it yourself, just set up a local solr server as described here:
https://solr.apache.org/guide/solr/latest/deployment-guide/installing-solr.html
Start the solr server: bin/solr start
Create a simple core: bin/solr create -c dario
Now in the browser go to http://localhost:8983/solr/#/dario/schema and add two
fields.
Field 1:
Use name_general for name and text_general for the field type. Leave everything
else as default.
Field 2:
Use name_de for name and text_de for the field type. Leave everything else as
default.
Then go to http://localhost:8983/solr/#/dario/documents to index some documents.
Select CSV for Document Type.
In the textfield Document(s) type in the fallowing text: (hyphens mark start
and end of text, should be excluded from pasted text)
-
name_general,name_de
DARIO,DARIO
DARIOT,DARIOT
DARIOTE,DARIOTE
DARIOTEN,DARIOTEN
DARIOTENU,DARIOTENU
-
Now go to http://localhost:8983/solr/#/dario/query to search documents.
First, I use the query *:* to find all documents. This works as expected.
Then I use the query name_general:* to find all documents with the field
name_general. This again finds all documents as expected. The same for name_de:*
Now I want to find specific documents with the text_general field:
Searching for name_general:DARIO finds one document where name_general is
exactly DARIO. This is what I expected. The same thing happens for all other
names.
Now it gets interesting. Searching for text_de leads to all kind of weird
results. Listed below are the queries and what is found with this query, and if
I expected the result or not:
-
name_de:DARIO -> DARIO (expected)
name_de:DARIOT -> DARIOT (expected), DARIOTE (not expected), DARIOTEN (not
expected)
name_de:DARIOTE -> DARIOT (not expected), DARIOTE (expected), DARIOTEN (not
expected)
name_de:DARIOTEN -> DARIOT (not expected), DARIOTE (not expected), DARIOTEN
(expected)
name_de:DARIOTENU -> DARIOTENU (expected)
-
This probably has something to do with stemming. But why should DARIOTEN be
stemmed to DARIOT? (Sorry for the bad English, as you can see with the usage of
text_de my first language is German)
DARIOTEN is also not defined in the stop words file or anywhere else. So what
is happening here?
Now I also tried enclosing the searched word in quotes. E. g. name_de:"DARIO"
This did not change the results.
So this is one problem that I can not explain, but would very much like to
understand. But this is not my only confusion with this field type. What
fallows is another problem.
Sometimes I want to find documents that contain some text (as opposed to exact
matches)
So I enclose the name with asterisks. I expect these queries to find all the
documents that EITHER exactly match the word OR contain the word.
Again, this works as expected with the name_general field. But again the
name_de behaves weirdly.
I will again list all the queries and the results. This time I will also note
the documents that were missing from the results, but I expected them to be
included. I will note these documents after the pipe |.
As this time no documents are found that should not be found, the expected /
not expected marker is omitted.
-
name_de:*DARIO* -> DARIO, DARIOT, DARIOTE, DARIOTEN, DARIOTENU
name_de:*DARIOT* -> DARIOT, DARIOTE, DARIOTEN, DARIOTENU
name_de:*DARIOTE* -> DARIOTENU | DARIOTE, DARIOTEN
name_de:*DARIOTEN* -> DARIOTENU | DARIOTEN
name_de:*DARIOTENU* -> DARIOTENU
-
As you can see for name_de:*DARIOTE* and name_de:*DARIOTEN* not all documents
are found that I expected to be found.
What are the mechanisms leading to this behavior?
I also tried all queries using regular expressions. E. g. name_de:/.*DARIO.*/
This lead to the same results as before.
I look forward to an easy to understand explanation for why text_de behaves the
way it does.
With kind regards
Dario Viva