Re: Issue with Japanese User Dictionary

2022-01-13 Thread Tomoko Uchida
Hi, > The only thing that seems to differ is that the characters are full-width > vs half-width, so I was wondering if this is intended behavior or a bug/too > restrictive This is intended behavior. The first column in the user dictionary must be equal to the concatenated string of the second col

Re: Issue with Japanese User Dictionary

2022-01-13 Thread Marc D'Mello
Hi Mike, Thanks for the response! I'm actually not super familiar with UserDictionaries, but looking at the code, it seems like a single line in the user provided user dictionary corresponds to a single entry? In that case, here is the line (or entry) that does have both widths that I believe is c

Re: Issue with Japanese User Dictionary

2022-01-13 Thread Michael Sokolov
HI Marc, I wonder if there is a workaround for this issue: eg, could we have entries for both widths? I wonder if there is some interaction with an analysis chain that is doing half-width -> full-width conversion (or vice versa)? I think the UserDictionary has to operate on pre-analyzed tokens ...

Re: Moving from lucene 6.x to 8.x

2022-01-13 Thread Michael Sokolov
I think the "broken offsets" refers to offsets of tokens "going backwards". Offsets are attributes of tokens that refer back to their byte position in the original indexed text. Going backwards means -- a token with a greater position (in the sequence of tokens, or token graph) should not have a le

Re: Migration from Lucene 5.5 to 8.11.1

2022-01-13 Thread András Péteri
It looks like Sascha runs IndexUpgrader for all major versions, ie. 6.6.6, 7.7.3 and 8.11.1. File "segments_91" is written by the 7.7.3 run immediately before the error. On Wed, Jan 12, 2022 at 3:44 PM Adrien Grand wrote: > The log says what the problem is: version 8.11.1 cannot read indices > c