Re: standard tokenizer seemingly splitting on dot

2023-05-04 Thread Bill Tantzen
Rahul, No I do not, but note that this behavior has been observed by others and reported as a possible issue. Thank you! ~~Bill On Thu, May 4, 2023 at 1:07 PM Rahul Goswami wrote: > Bill, > Do you have a WordDelimiterFilterFactory in the analysis chain (with > "*preserveOriginal" > *attribute li

Re: standard tokenizer seemingly splitting on dot

2023-05-04 Thread Rahul Goswami
Bill, Do you have a WordDelimiterFilterFactory in the analysis chain (with "*preserveOriginal" *attribute likely set to *0*)? That would split the token on the period downstream in the analysis chain even if StandardTokenizer doesn't. -Rahul On Thu, May 4, 2023 at 6:22 AM Mikhail Khludnev wrote:

Re: standard tokenizer seemingly splitting on dot

2023-05-04 Thread Mikhail Khludnev
Raised https://github.com/apache/lucene/issues/12264. Let's look at what devs say. On Wed, May 3, 2023 at 6:13 PM Bill Tantzen wrote: > Shawn, > No, email addresses are not preserved -- from the docs: > > >- > >The "@" character is among the set of token-splitting punctuation, so >em

Re: standard tokenizer seemingly splitting on dot

2023-05-03 Thread Bill Tantzen
Shawn, No, email addresses are not preserved -- from the docs: - The "@" character is among the set of token-splitting punctuation, so email addresses are not preserved as single tokens. but the non-split on "test.com" vs the split on "test7.com" is unexpected! ~~Bill On Wed, May 3,

Re: standard tokenizer seemingly splitting on dot

2023-05-03 Thread Shawn Heisey
On 5/2/23 15:30, Bill Tantzen wrote: This works as I expected: ab00c.tif -- tokenizes as it should with a value of ab00c.tif This doesn't work as I expected ab003.tif -- tokenizes with a result of ab003 and tif I got the same behavior with ICUTokenizer, which uses ICU4J for Unicode handling.

Re: standard tokenizer seemingly splitting on dot

2023-05-02 Thread Gus Heck
That looks like a bug. Seems to be splitting if the character class before and after differ, but not if they are the same. ST XYZ123 tif SF XYZ123 tif LCF xyz123 tif and ST XYZ 123tif SF XYZ 123tif LCF xyz 123tif But... ST XYZ123.123tif SF XYZ123.123tif LCF xyz123.123tif On Tue, May 2,

Re: standard tokenizer seemingly splitting on dot

2023-05-02 Thread Bill Tantzen
OK, I see what's going on. I should not have used a generic example like XYZ. In my specific case, as you can see, I'm working with filenames. This works as I expected: ab00c.tif -- tokenizes as it should with a value of ab00c.tif This doesn't work as I expected ab003.tif -- tokenizes with a re

Re: standard tokenizer seemingly splitting on dot

2023-05-02 Thread Shawn Heisey
On 5/2/23 13:16, Bill Tantzen wrote: This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions: Periods (dots) that are not followed by whitespace are kept as part of the token, including

Re: standard tokenizer seemingly splitting on dot

2023-05-02 Thread Gus Heck
I concur that the docs clearly state your expected behavior should be true: Standard Tokenizer This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions: - Periods (dots) that are n

Re: standard tokenizer seemingly splitting on dot

2023-05-02 Thread Bill Tantzen
Thanks Dave! Using a string field instead would work fine for my purposes I think... I'm just trying to understand why it doesn't work with a field of type text_general which uses the standard tokenizer in both the index and the query analyzer. The docs state: This tokenizer splits the text field

Re: standard tokenizer seemingly splitting on dot

2023-05-02 Thread Dave
You’re not doing anything wrong, a dot is not a character so it splits the field in the index and the query. If you used a string instead it theoretically would maintain the non characters but also lead to more strict search constraints. If you tried this you need to re index a couple documents

Re: standard tokenizer seemingly splitting on dot

2023-05-02 Thread Bill Tantzen
I'm using the solrconfig.xml from the distribution, ./server/solr/configsets/_default/conf/solrconfig.xml But this problem extends to the index as well; using the initial example, if I search for metadata_txt:ab1 (instead of ab1.tif), my result set includes ab1.tif, ab1.jpg, ab

Re: standard tokenizer seemingly splitting on dot

2023-05-02 Thread Mikhail Khludnev
Analyzer is configured in schema.xml. But literally, splitting on dot is what I expect from StandardTokenizer. On Tue, May 2, 2023 at 8:48 PM Bill Tantzen wrote: > Mikhail, > Thanks for the quick reply. Here is the parser info: > > LuceneQParser > > ~~Bill > > On Tue, May 2, 2023 at 12:43 PM Mi

Re: standard tokenizer seemingly splitting on dot

2023-05-02 Thread Bill Tantzen
Mikhail, Thanks for the quick reply. Here is the parser info: LuceneQParser ~~Bill On Tue, May 2, 2023 at 12:43 PM Mikhail Khludnev wrote: > Hello Bill, > Which analyzer is configured for metadata_txt? Perhaps you need to tune it > accordingly. > > On Tue, May 2, 2023 at 7:40 PM Bill Tantzen

Re: standard tokenizer seemingly splitting on dot

2023-05-02 Thread Mikhail Khludnev
Hello Bill, Which analyzer is configured for metadata_txt? Perhaps you need to tune it accordingly. On Tue, May 2, 2023 at 7:40 PM Bill Tantzen wrote: > In my solr 9.2 schema, I am leveraging the dynamicField > > stored="true"/> > > which tokenizes with solr.StandardTokenizerFactory for index

standard tokenizer seemingly splitting on dot

2023-05-02 Thread Bill Tantzen
In my solr 9.2 schema, I am leveraging the dynamicField which tokenizes with solr.StandardTokenizerFactory for index and query. However, when I query with, for example, metadata_txt:XYZ.tif I see many more hits than I expect. When I add debug=true to the query, I see: metadata_txt:XYZ.tif met