[PR] Add case-insensitive regex matching support for FST LUCENE indexes [pinot]

via GitHub Thu, 03 Jul 2025 13:28:18 -0700


raghavyadav01 opened a new pull request, #16276:
URL: https://github.com/apache/pinot/pull/16276


   ## Summary
   
   This PR adds the support for case-insensitive regex matching in FST (Finite 
State Transducer) LUCENE indexes while maintaining backward compatibility with 
existing case-sensitive FST implementations.
   
   ## FST Behavior: Input/Output Types
   
   ### Existing FST Implementation
   - **Input Type**: BYTE4 (UTF-16 encoded strings)
   - **Output Type**: LONG (single dictionary ID)
   - **Mapping**: One key → One value (e.g., "Hello" → 1)
   
   ### Problem with Case-Insensitive Requirements
   For case-insensitive matching, we need to map multiple case variations to 
the same normalized key:
   - "Hello" → 1
   - "HELLO" → 2
   - "hello" → 3
   
   **Challenge**: The existing LONG output type can only store a single value 
per key, but we need to store multiple dictionary IDs for the same normalized 
key.
   
   ## Solution: Use BytesRef to Store Multiple Values
   
   ### New Case-Insensitive FST Implementation
   - **Input Type**: BYTE4 (UTF-16 encoded strings, normalized to lowercase)
   - **Output Type**: BytesRef (serialized list of dictionary IDs)
   - **Mapping**: One normalized key → Multiple values (e.g., "hello" → [1, 2, 
3])
   
   ### Magic Header Detection
   To automatically distinguish between case-sensitive and case-insensitive 
FSTs:
   
   The reader checks the first 4 bytes to determine the FST type:
   - If magic header = "FSTI" → Read as BytesRef (case-insensitive)
   - If magic header = "\fsa" → Read as Long (case-sensitive, backward 
compatibility)
   
   ## Backward Compatibility
   **Existing segments continue to work as-is with no changes required.**
   
   - All existing case-sensitive FST segments remain fully functional
   - No migration needed for current deployments
   - New case-insensitive FSTs are automatically detected via magic header
   
   ## Sample Configuration
   
   ```json
   {
     "tableName": "user_logs",
     "fieldConfigList": [
       {
         "name": "domain_name",
         "encodingType": "DICTIONARY",
         "indexes": {
           "fst": {
             "type": "LUCENE",
             "caseSensitive": false
           }
         }
       }
     ]
   }
   ```
   
   ## Sample Query
   
   ```sql
   SELECT domain_name, COUNT(*) 
   FROM user_logs 
   WHERE REGEXP_LIKE(domain_name, 'WWW.EXAMPLE.*') 
   GROUP BY domain_name
   ```
   
   **Response:**
   ```json
   {
     "resultTable": {
       "dataSchema": {
         "columnNames": ["domain_name", "count(*)"],
         "columnDataTypes": ["STRING", "LONG"]
       },
       "rows": [
         ["www.example.com", 100],
         ["WWW.EXAMPLE.ORG", 50], 
         ["www.Example.net", 75]
       ]
     }
   }
   ```
   ## Testing
   
   - Added case-insensitive FST tests 
(`FSTBasedCaseInsensitiveRegexpLikeQueriesTest.java`)
   - Enhanced existing case-sensitive tests to ensure backward compatibility
   - Added FST builder tests for both modes
   - Added configuration serialization/deserialization tests
   
   ## Breaking Changes
   
   None. This change is fully backward compatible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Add case-insensitive regex matching support for FST LUCENE indexes [pinot]

Reply via email to