Ryan19929 opened a new pull request, #269: URL: https://github.com/apache/doris-thirdparty/pull/269
Support IK tokenizer for inverted index: Migrate analysis-ik from Java to C++, Implement basic tokenization functionality. The major differences from the original Java code are as follows: 1. **Encoding Format Difference**: Use /jieba/Unicode.hpp to process characters in IK-C++. 2. **Memory Management Optimization**: Add a custom allocator to avoid performance overhead caused by frequent memory allocation in STL containers. 3. **Remote Dictionary Support**: IK-C++ does not currently support remote dictionaries. Major changes to the original code: 1. **testChinese.cpp**: Add test for testing Chinese tokenization speed. Use the dataset located at `/src/test/data/contribs-lib/analysis/chinese/speed-test-text.txt` (红楼梦) for testing. 2. **LanguageBasedAnalyzer.h/cpp**: Add IK tokenizer configuration, initialization entry, and dictionary loading logic. Add the IK tokenization mode entry (temporary mode entry) in `AnalyzerMode`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
