Rahul,
No I do not, but note that this behavior has been observed by others and
reported as a possible issue.
Thank you!
~~Bill
On Thu, May 4, 2023 at 1:07 PM Rahul Goswami wrote:
> Bill,
> Do you have a WordDelimiterFilterFactory in the analysis chain (with
> "*preserveOriginal"
> *attribute li
Bill,
Do you have a WordDelimiterFilterFactory in the analysis chain (with
"*preserveOriginal"
*attribute likely set to *0*)?
That would split the token on the period downstream in the analysis chain
even if StandardTokenizer doesn't.
-Rahul
On Thu, May 4, 2023 at 6:22 AM Mikhail Khludnev wrote:
Raised https://github.com/apache/lucene/issues/12264.
Let's look at what devs say.
On Wed, May 3, 2023 at 6:13 PM Bill Tantzen
wrote:
> Shawn,
> No, email addresses are not preserved -- from the docs:
>
>
>-
>
>The "@" character is among the set of token-splitting punctuation, so
>em
Shawn,
No, email addresses are not preserved -- from the docs:
-
The "@" character is among the set of token-splitting punctuation, so
email addresses are not preserved as single tokens.
but the non-split on "test.com" vs the split on "test7.com" is unexpected!
~~Bill
On Wed, May 3,
On 5/2/23 15:30, Bill Tantzen wrote:
This works as I expected:
ab00c.tif -- tokenizes as it should with a value of ab00c.tif
This doesn't work as I expected
ab003.tif -- tokenizes with a result of ab003 and tif
I got the same behavior with ICUTokenizer, which uses ICU4J for Unicode
handling.
That looks like a bug. Seems to be splitting if the character class before
and after differ, but not if they are the same.
ST
XYZ123
tif
SF
XYZ123
tif
LCF
xyz123
tif
and
ST
XYZ
123tif
SF
XYZ
123tif
LCF
xyz
123tif
But...
ST
XYZ123.123tif
SF
XYZ123.123tif
LCF
xyz123.123tif
On Tue, May 2,
OK, I see what's going on. I should not have used a generic example like
XYZ.
In my specific case, as you can see, I'm working with filenames.
This works as I expected:
ab00c.tif -- tokenizes as it should with a value of ab00c.tif
This doesn't work as I expected
ab003.tif -- tokenizes with a re
On 5/2/23 13:16, Bill Tantzen wrote:
This tokenizer splits the text field into tokens, treating whitespace and
punctuation as delimiters.
Delimiter characters are discarded, with the following exceptions:
Periods (dots) that are not followed by whitespace are kept as part of the
token, including
I concur that the docs clearly state your expected behavior should be true:
Standard Tokenizer
This tokenizer splits the text field into tokens, treating whitespace and
punctuation as delimiters. Delimiter characters are discarded, with the
following exceptions:
-
Periods (dots) that are n
Thanks Dave!
Using a string field instead would work fine for my purposes I think...
I'm just trying to understand why it doesn't work with a field of type
text_general which uses the standard tokenizer in both the index and the
query analyzer. The docs state:
This tokenizer splits the text field
You’re not doing anything wrong, a dot is not a character so it splits the
field in the index and the query. If you used a string instead it theoretically
would maintain the non characters but also lead to more strict search
constraints. If you tried this you need to re index a couple documents
I'm using the solrconfig.xml from the distribution,
./server/solr/configsets/_default/conf/solrconfig.xml
But this problem extends to the index as well; using the initial example,
if I search for metadata_txt:ab1 (instead
of ab1.tif), my result set includes ab1.tif, ab1.jpg,
ab
Analyzer is configured in schema.xml. But literally, splitting on dot is
what I expect from StandardTokenizer.
On Tue, May 2, 2023 at 8:48 PM Bill Tantzen
wrote:
> Mikhail,
> Thanks for the quick reply. Here is the parser info:
>
> LuceneQParser
>
> ~~Bill
>
> On Tue, May 2, 2023 at 12:43 PM Mi
Mikhail,
Thanks for the quick reply. Here is the parser info:
LuceneQParser
~~Bill
On Tue, May 2, 2023 at 12:43 PM Mikhail Khludnev wrote:
> Hello Bill,
> Which analyzer is configured for metadata_txt? Perhaps you need to tune it
> accordingly.
>
> On Tue, May 2, 2023 at 7:40 PM Bill Tantzen
Hello Bill,
Which analyzer is configured for metadata_txt? Perhaps you need to tune it
accordingly.
On Tue, May 2, 2023 at 7:40 PM Bill Tantzen
wrote:
> In my solr 9.2 schema, I am leveraging the dynamicField
>
> stored="true"/>
>
> which tokenizes with solr.StandardTokenizerFactory for index
In my solr 9.2 schema, I am leveraging the dynamicField
which tokenizes with solr.StandardTokenizerFactory for index and query.
However, when I query with, for example,
metadata_txt:XYZ.tif
I see many more hits than I expect. When I add debug=true to the query, I
see:
metadata_txt:XYZ.tif
met
16 matches
Mail list logo