RE: Can an analyzer access other field's data during index time?

2023-05-03 Thread Wang, Guan
be used to yield indexed and stored (Lucene) fields with different content. If your logic is so comprehensive you may also consider to completely extract analysis logic https://solr.apache.org/guide/solr/latest/indexing-guide/external-files-processes.html#the-preanalyzedfield-type On Tue, Apr

RE: Can an analyzer access other field's data during index time?

2023-04-25 Thread Wang, Guan
time? External Email - Use Caution Guan, I hardly grasp the particular obstacle. But I don't think that the task is out of reach overall. Can you share a test case formally describing the desired behavior? On Tue, Apr 25, 2023 at 12:29 AM Wang, Guan wrote: > Hi Mikhail, > > Thank

RE: Can an analyzer access other field's data during index time?

2023-04-24 Thread Wang, Guan
2023 at 11:40 PM Wang, Guan wrote: > Hi Mikhail, > > Thank you for the definitive answer! > > I could "solve" this by adding a header in the document with proper > information to guide the indexing process. Header will be parsed then > ignored by the tokenizer. However

RE: Can an analyzer access other field's data during index time?

2023-04-24 Thread Wang, Guan
he existing codebase where the Field has no reference to enclosing Document. sigh. On Mon, Apr 24, 2023 at 6:00 PM Wang, Guan wrote: > Hi, > > I understand Lucene analyzer is per field basis. But I wonder if it's > even possible for an analyzer on field A to be able to access da

Can an analyzer access other field's data during index time?

2023-04-24 Thread Wang, Guan
Hi, I understand Lucene analyzer is per field basis. But I wonder if it's even possible for an analyzer on field A to be able to access data in field B during the index process on any stage, saying CharFilter, Tokenizer or TokenFilter? I'd like to control the behavior of the indexing process fo

RE: Integrating NLP into Lucene Analysis Chain

2022-11-21 Thread Wang, Guan
Hi Luke, For what you've described as a "bug" for NLPPOSTaggerOp, I do agree with you that there could be a more elegant solution than simply synchronizing the entire method. That has been said, IMHO, I don't see there is a thread-safe issue. Lucene TokenFilters are not supposed to be shared am

Buffer size for SegmentingTokenizerBase

2022-03-18 Thread Wang, Guan
Hi, May someone explain to me why class SegmentingTokenizerBase using a buffer with a size of only 1024 characters? In the source code, the comment was left there mentioning possible truncated token if no safe-stopping index can be found for the existing chars in the buffer. It doesn't sound r