nishant94 opened a new issue, #64646: URL: https://github.com/apache/doris/issues/64646
### Search before asking - [x] I had searched in the [issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no similar issues. ### Description Doris inverted index currently has no Japanese-aware tokenizer. Japanese text has no whitespace word boundaries, so the existing english/unicode/standard parsers either index whole runs of text or split on characters, both of which give poor MATCH / MATCH_PHRASE recall and precision for Japanese content. OpenSearch and Elasticsearch solve this with the Lucene kuromoji morphological analyzer. Doris already ships a comparable CJK analyzer for Chinese aka the IK analyzer `(be/src/storage/index/inverted/analyzer/ik/)`. but there is no equivalent for Japanese. This enhancement proposes a built-in kuromoji parser, selectable per inverted-indexed column via DDL, that segments Japanese text into morphemes at index and query time: ``` INDEX content_idx (`content`) USING INVERTED PROPERTIES("parser" = "kuromoji", "parser_mode" = "search") ``` Once indexed, MATCH, MATCH_PHRASE, and TOKENIZE() operate over the segmented Japanese terms. ### Motivation - Enables accurate full-text search over Japanese columns, on par with OpenSearch/Lucene kuromoji. - Fills the obvious gap next to the existing Chinese (IK) analyzer. - Implemented natively in C++ with no JVM on the indexing hot path, and Apache-license-clean (engine is Apache-2.0; the IPADIC dictionary is NAIST-2003, the same permissive lexicon Apache Lucene already bundles). ### Solution Add a native C++ port of the Lucene kuromoji analyzer, following the proven IK pattern (native C++ analyzer + tokenizer, dictionary as runtime data files). - An offline converter that compiles raw IPADIC into a C++-native runtime format, rather than reimplementing Lucene's FST byte format. - `parser_mode` support: search (default, with SEARCH-mode decompounding), normal, and extended. ### Are you willing to submit PR? - [x] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
