chenboat opened a new issue, #16294: URL: https://github.com/apache/pinot/issues/16294
n-gram (https://en.wikipedia.org/wiki/N-gram) index can be applied to string columns to speed up queries with LIKE filtering condition (e.g., LIKE('%pino%')). The basic idea is to extract consecutive character sequences (e.g., _pin_, _ino_, _not_ for n =3 on the string _pinot_) from origin strings and build inverted index on these sequences. When processing LIKE filters for a sub-string match, one can break down the substring into multiple grams similar to the indexing process and then look for matching documents which contains ALL the grams. The remaining documents (often much fewer) are then string matched to validate the exact matching. In some cases where the search string is shorter or equal to the ngram, the final validation can be omitted. Ngram is similar to bloom filtering: both provide effective pruning of non-matching documents. ngram index is available to open source libraries like [ElasticSearch](https://www.elastic.co/docs/reference/text-analysis/analysis-ngram-tokenizer) and [StarRocks](https://docs.starrocks.io/docs/table_design/indexes/Ngram_Bloom_Filter_Index/). Compared with text index, n-gram index is bigger in size in general because it extracts sub-word sequences. On the other hand, it can process wildcard queries (e.g., %pino%) more efficiently -- note that libraries like Lucene usually discourage leading * in their text queries ([ref](https://lucene.apache.org/core/2_9_4/queryparsersyntax.html#Wildcard%20Searches). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
