chenboat opened a new issue, #16294:
URL: https://github.com/apache/pinot/issues/16294

   n-gram (https://en.wikipedia.org/wiki/N-gram) index can be applied to string 
columns to speed up queries with LIKE filtering condition (e.g., 
LIKE('%pino%')). The basic idea is to extract consecutive character sequences 
(e.g., _pin_, _ino_, _not_ for n =3 on the string _pinot_) from origin strings 
and build inverted index on these sequences. When processing LIKE filters for a 
sub-string match, one can break down the substring into multiple grams similar 
to the indexing process and then look for matching documents which contains ALL 
the grams. The remaining documents (often much fewer) are then string matched 
to validate the exact matching. In some cases where the search string is 
shorter or equal to the ngram, the final validation can be omitted. Ngram is 
similar to bloom filtering: both provide effective pruning of non-matching 
documents.
   
   ngram index is available to open source libraries like 
[ElasticSearch](https://www.elastic.co/docs/reference/text-analysis/analysis-ngram-tokenizer)
  and 
[StarRocks](https://docs.starrocks.io/docs/table_design/indexes/Ngram_Bloom_Filter_Index/).
   
   Compared with text index, n-gram index is bigger in size in general because 
it extracts sub-word sequences. On the other hand, it can process wildcard 
queries (e.g., %pino%) more efficiently -- note that libraries like Lucene 
usually discourage leading * in their text queries 
([ref](https://lucene.apache.org/core/2_9_4/queryparsersyntax.html#Wildcard%20Searches).
 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to