[PR] Add first implementation of clpMatch that doesn't explicitly use indexes. [pinot]

via GitHub Fri, 19 Jan 2024 17:46:24 -0800


kirkrodrigues opened a new pull request, #12291:
URL: https://github.com/apache/pinot/pull/12291


   tags: feature, backward-incompat, release-notes
   
   This adds query rewriting logic to transform a "virtual" UDF, `clpMatch`, 
into a boolean expression on the columns of a CLP-encoded field.
   
   E.g., if the `message` field was encoded with CLP, users can write:
   
   ```sql
   SELECT clpDecode(message) FROM table WHERE clpMatch(message, 'Job started')
   ```
   
   Which will get transformed into:
   
   ```sql
   SELECT clpDecode(message_logtype, message_dictionaryVars, 
message_encodedVars) FROM table WHERE message_logtype = 'Job started'
   ```
   
   Naturally, the presence of wildcards in the query makes this rewriting more 
complicated. For instance, if the user writes:
   
   ```sql
   SELECT clpDecode(message) FROM table WHERE clpMatch(message, 'Started job*')
   ```
   This will get transformed into:
   
   ```sql
   SELECT clpDecode(message_logtype, message_dictionaryVars, 
message_encodedVars) FROM table WHERE (REGEXP_LIKE(message_logtype, '^Started 
job.*') OR (REGEXP_LIKE(message_logtype, '^Started \u0012.*') AND 
REGEXP_LIKE(message_dictionaryVars, '^job.*'))) AND 
REGEXP_LIKE(clpDecode(message_logtype, message_dictionaryVars, 
message_encodedVars), '^Started job.*')
   ```
   
   This query translation is largely handled by the `clp-ffi` library so the 
query rewriter need only convert it into SQL.
   
   In order to perform wildcard matches on the `encodedVars` column of each 
CLP-encoded field, this PR also adds a new transform function 
`clpEncodedVarsMatch` that should only be called from the rewriter.
   
   The performance of this query can be improved by adding indexes and having 
the rewriter become index-aware. This will be added in a future PR.
   
   To use the rewriter, users need to change their broker config to add 
`org.apache.pinot.sql.parsers.rewriter.ClpRewriter` to 
`pinot.broker.query.rewriter.class.names`. Assuming the default set of query 
rewriters, that would look like:
   
   ```
   
org.apache.pinot.sql.parsers.rewriter.CompileTimeFunctionsInvoker,org.apache.pinot.sql.parsers.rewriter.SelectionsRewriter,org.apache.pinot.sql.parsers.rewriter.PredicateComparisonRewriter,org.apache.pinot.sql.parsers.rewriter.ClpRewriter,org.apache.pinot.sql.parsers.rewriter.AliasApplier,org.apache.pinot.sql.parsers.rewriter.OrdinalsUpdater,org.apache.pinot.sql.parsers.rewriter.NonAggregationGroupByToDistinctQueryRewriter
   ```
   Note that we added it before the `AliasApplier` so that any aliasing of the 
CLP-encoded field happens only after the `clpDecode` rewrite.
   
   NOTE:
   * In #11006 we added CLPDecodeRewriter to make it easier to call `clpDecode` 
for a CLP-encoded field. This PR renames and modifies that rewriter to also 
perform the required query rewriting. This is why this change is marked as 
backward-incompatible.
   
   
   This is part of the change requested in #9819 and described in this [design 
doc](https://docs.google.com/document/d/1nHZb37re4mUwEA258x3a2pgX13EWLWMJ0uLEDk1dUyU/edit#heading=h.7p47gbd7unf9).
   
   # Testing performed
   * Added new unit tests.
   * Validated fields encoded with CLP could be queried correctly using 
`clpMatch`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Add first implementation of clpMatch that doesn't explicitly use indexes. [pinot]

Reply via email to