nishant94 opened a new pull request, #64667:
URL: https://github.com/apache/doris/pull/64667

    ### What problem does this PR solve?
   
   Issue Number: https://github.com/apache/doris/issues/64646
   
   Related PR: None
   
   Problem Summary:
   Doris has no Japanese-aware tokenizer for the inverted index. Japanese text 
has no spaces between words, so the existing parsers can't segment it and 
`MATCH` / `MATCH_PHRASE` on Japanese columns end up with poor recall and 
precision.
   
   This PR adds a built-in `kuromoji` parser for Japanese, in the same style as 
the existing Chinese IK analyzer. It's opt-in per column:
    ```sql
     INDEX content_idx (`content`) USING INVERTED
     PROPERTIES("parser" = "kuromoji", "parser_mode" = "search");
   ```
   After indexing, MATCH, MATCH_PHRASE and TOKENIZE() run against the segmented 
Japanese terms.
   
   How it works:
   - Native C++ under `be/src/storage/index/inverted/analyzer/kuromoji/`, so 
there's no JVM on the indexing path. KuromojiAnalyzer / KuromojiTokenizer 
mirror the IK analyzer/tokenizer, with a Viterbi cost-model segmenter over the 
IPADIC connection-cost matrix.
     - The dictionary is a process-wide singleton loaded once from 
${inverted_index_dict_path}/kuromoji. An offline converter compiles raw IPADIC 
into a compact C++ runtime format (double-array trie + cost matrix + 
char/unknown tables) at build time, so no binary blob is committed.
     - search (default), normal and extended modes are supported. No 
thrift/proto changes — parser and mode ride as strings in the index properties.
   
   Dictionary source is mecab-ipadic-2.7.0-20070801 (NAIST-2003 license, the 
same lexicon Lucene kuromoji uses).
   
   ### Release note
   Support Japanese text tokenization in the inverted index via a new kuromoji 
parser (`PROPERTIES("parser"="kuromoji")`), with `search/normal/extended` modes.
   
   Check List (For Author)
   
     - Test
       - [x] Regression test
       - [x] Unit Test
       - [x] Manual test (add detailed scripts or steps below)
   
   ```sql
     CREATE TABLE test_jp (
       id BIGINT,
       content TEXT,
       INDEX idx_content (content) USING INVERTED
         PROPERTIES("parser" = "kuromoji", "parser_mode" = "search")
     ) ENGINE=OLAP
     DUPLICATE KEY(id)
     DISTRIBUTED BY HASH(id) BUCKETS 1
     PROPERTIES("replication_num" = "1");
   
     INSERT INTO test_jp VALUES
       (1, '東京都に住んでいます'),
       (2, '日本語の形態素解析エンジン');
   
     -- search-mode decompounding: 東京都 also matches 東京
     SELECT id FROM test_jp WHERE content MATCH '東京';          -- expect: 1
     SELECT id FROM test_jp WHERE content MATCH_PHRASE '形態素解析'; -- expect: 2
   
     -- inspect segmentation directly
     SELECT TOKENIZE('東京都に住んでいます', 
'"parser"="kuromoji","parser_mode"="search"');
   ```
   
     - Behavior changed:
       - [ ] No.
       - [x] Yes. It adds a new opt-in kuromoji parser. Existing parsers and 
their output are unchanged; the new behavior only applies to indexes that 
explicitly set `parser="kuromoji"`.
     - Does this need documentation?
       - [ ] No.
       - [x] Yes. [PR Link](https://github.com/apache/doris-website/pull/3946) 
to Doris-Website.
   
     Check List (For Reviewer who merge this PR)
   
     - [ ] Confirm the release note
     - [ ] Confirm test cases
     - [ ] Confirm document
     - [ ] Add branch pick label
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to