Ryan19929 opened a new pull request, #269:
URL: https://github.com/apache/doris-thirdparty/pull/269

   Support IK tokenizer for inverted index:
   Migrate analysis-ik from Java to C++, Implement basic tokenization 
functionality.
   The major differences from the original Java code are as follows:
   1. **Encoding Format Difference**: Use /jieba/Unicode.hpp to process 
characters in IK-C++. 
   2. **Memory Management Optimization**: Add a custom allocator to avoid 
performance overhead caused by frequent memory allocation in STL containers.
   3. **Remote Dictionary Support**: IK-C++ does not currently support remote 
dictionaries.
   
   Major changes to the original code:
   1. **testChinese.cpp**: Add test for testing Chinese tokenization speed. Use 
the dataset located at 
`/src/test/data/contribs-lib/analysis/chinese/speed-test-text.txt` (红楼梦) for 
testing.
   2. **LanguageBasedAnalyzer.h/cpp**: 
   Add IK tokenizer configuration, initialization entry, and dictionary loading 
logic. 
   Add the IK tokenization mode entry (temporary mode entry) in `AnalyzerMode`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to