nishant94 opened a new pull request, #64667:
URL: https://github.com/apache/doris/pull/64667
### What problem does this PR solve?
Issue Number: https://github.com/apache/doris/issues/64646
Related PR: None
Problem Summary:
Doris has no Japanese-aware tokenizer for the inverted index. Japanese text
has no spaces between words, so the existing parsers can't segment it and
`MATCH` / `MATCH_PHRASE` on Japanese columns end up with poor recall and
precision.
This PR adds a built-in `kuromoji` parser for Japanese, in the same style as
the existing Chinese IK analyzer. It's opt-in per column:
```sql
INDEX content_idx (`content`) USING INVERTED
PROPERTIES("parser" = "kuromoji", "parser_mode" = "search");
```
After indexing, MATCH, MATCH_PHRASE and TOKENIZE() run against the segmented
Japanese terms.
How it works:
- Native C++ under `be/src/storage/index/inverted/analyzer/kuromoji/`, so
there's no JVM on the indexing path. KuromojiAnalyzer / KuromojiTokenizer
mirror the IK analyzer/tokenizer, with a Viterbi cost-model segmenter over the
IPADIC connection-cost matrix.
- The dictionary is a process-wide singleton loaded once from
${inverted_index_dict_path}/kuromoji. An offline converter compiles raw IPADIC
into a compact C++ runtime format (double-array trie + cost matrix +
char/unknown tables) at build time, so no binary blob is committed.
- search (default), normal and extended modes are supported. No
thrift/proto changes — parser and mode ride as strings in the index properties.
Dictionary source is mecab-ipadic-2.7.0-20070801 (NAIST-2003 license, the
same lexicon Lucene kuromoji uses).
### Release note
Support Japanese text tokenization in the inverted index via a new kuromoji
parser (`PROPERTIES("parser"="kuromoji")`), with `search/normal/extended` modes.
Check List (For Author)
- Test
- [x] Regression test
- [x] Unit Test
- [x] Manual test (add detailed scripts or steps below)
```sql
CREATE TABLE test_jp (
id BIGINT,
content TEXT,
INDEX idx_content (content) USING INVERTED
PROPERTIES("parser" = "kuromoji", "parser_mode" = "search")
) ENGINE=OLAP
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 1
PROPERTIES("replication_num" = "1");
INSERT INTO test_jp VALUES
(1, '東京都に住んでいます'),
(2, '日本語の形態素解析エンジン');
-- search-mode decompounding: 東京都 also matches 東京
SELECT id FROM test_jp WHERE content MATCH '東京'; -- expect: 1
SELECT id FROM test_jp WHERE content MATCH_PHRASE '形態素解析'; -- expect: 2
-- inspect segmentation directly
SELECT TOKENIZE('東京都に住んでいます',
'"parser"="kuromoji","parser_mode"="search"');
```
- Behavior changed:
- [ ] No.
- [x] Yes. It adds a new opt-in kuromoji parser. Existing parsers and
their output are unchanged; the new behavior only applies to indexes that
explicitly set `parser="kuromoji"`.
- Does this need documentation?
- [ ] No.
- [x] Yes. [PR Link](https://github.com/apache/doris-website/pull/3946)
to Doris-Website.
Check List (For Reviewer who merge this PR)
- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]