nishant94 opened a new issue, #64646:
URL: https://github.com/apache/doris/issues/64646

   ### Search before asking
   
   - [x] I had searched in the 
[issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no 
similar issues.
   
   
   ### Description
   
   Doris inverted index currently has no Japanese-aware tokenizer. Japanese 
text has no whitespace word boundaries, so the existing 
english/unicode/standard parsers either index whole runs of text or split on 
characters, both of which give poor MATCH / MATCH_PHRASE recall and precision 
for Japanese content.
   
   OpenSearch and Elasticsearch solve this with the Lucene kuromoji 
morphological analyzer. Doris already ships a comparable CJK analyzer for 
Chinese aka the IK analyzer `(be/src/storage/index/inverted/analyzer/ik/)`. but 
there is no equivalent for Japanese.
   
   This enhancement proposes a built-in kuromoji parser, selectable per 
inverted-indexed column via DDL, that segments Japanese text into morphemes at 
index and query time:
   
   ```
     INDEX content_idx (`content`) USING INVERTED
     PROPERTIES("parser" = "kuromoji", "parser_mode" = "search")
   ```
    
   Once indexed, MATCH, MATCH_PHRASE, and TOKENIZE() operate over the segmented 
Japanese terms.
   
   ### Motivation
     - Enables accurate full-text search over Japanese columns, on par with 
OpenSearch/Lucene kuromoji.
     - Fills the obvious gap next to the existing Chinese (IK) analyzer.
     - Implemented natively in C++ with no JVM on the indexing hot path, and 
Apache-license-clean (engine is Apache-2.0; the IPADIC dictionary is 
NAIST-2003, the same permissive lexicon Apache Lucene already
     bundles).
   
   ### Solution
   
   Add a native C++ port of the Lucene kuromoji analyzer, following the proven 
IK pattern (native C++ analyzer + tokenizer, dictionary as runtime data files).
   
   - An offline converter that compiles raw IPADIC into a C++-native runtime 
format, rather than reimplementing Lucene's FST byte format.
   - `parser_mode` support: search (default, with SEARCH-mode decompounding), 
normal, and extended.
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to