[ 
https://issues.apache.org/jira/browse/HUDI-9566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davis Zhang closed HUDI-9566.
-----------------------------
    Resolution: Fixed

code is merged

> Secondary index convert everything to string
> --------------------------------------------
>
>                 Key: HUDI-9566
>                 URL: https://issues.apache.org/jira/browse/HUDI-9566
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Davis Zhang
>            Priority: Major
>              Labels: pull-request-available
>
> h1. Action items
> For SI, we only allow the following data types are involved, on both read and 
> write path * String,
>  * Integer types, including Int, BigInt, Long, Short int.
>  * timestamp
>  * Char
>  * Boolean: Due to low cardinality we should not support boolean
> It means: * [Blocker AI] We only allow SI creation on columns with above data 
> types.
>  * [Blocker AI] On the index lookup path, only when the look up set shares 
> the same data type (integer types we allow them to be different, int v.s. 
> bigint is fine), we allow index lookup. otherwise we can do any of the 
> following
>  ** [Most ideal] Fall back to no index lookup
>  ** [back up plan] Query error out
>  ** [nice to have, non blocker] best effort type casting if viable. Cast 
> failure we fall back.
> This blocker AI applies to all code path that leads to index look up, which 
> is captured by the "read path" section.
>  
> [Blocker AI] To ensure that columns with SI get a consistent toString 
> behavior, we will ban schema evolution for those columns.
>  
> *[Warning]* We only fix spark code path, for other query engine, we need 
> separate plan and owners. 
> h1. Description
> when generating secondary index record from a data column which could be of 
> any type, we convert that "any type" to string via "toString". RLI probably 
> is doing similar things.
> h2. Read path
> So SI lookup, the lookup key has to generate the matching string, despite 
> that the look up key provider may provide keys in various data types. This 
> can cause surprises.
>  
> For example, if a SI is built out of a float column, the SI will use 
> float::ToString and index using string literal "10.0", yet when we do index 
> lookup, we provide a lookup set of  long/int type. It means the toString will 
> generate strings like "10"
> 10.0 == 10 is true, but "10.0".equals("10") is false
> This means even if values are numerically the same, we can still fail to 
> lookup the value and thus cause correctness issues. 
>  
> In case of spark SQL, the look up set can be generated by 
>  
> select * from tbl where secKey in (xxxx), where xxx can be static val or some 
> subquery in the future.
> select * from tbl where secKey= xxx
>  
> This is purely controlled by the lookup set provider, which is spark, based 
> on how the query is written. So the data type is beyond our control without 
> explicit type cast.
>  
> h2. Write path
> As of today, SI update always generate records from the data column, and data 
> column is of a data type consistently, unless there is a schema evolution. As 
> long as no schema evolution happens, it should be fine. 
>  
> With schema evolution changing data type (for example, int -> float), the 
> toString method will start to work differently. This means if previously SI 
> track string literal of int 10 as "10", now after bump up to float, it will 
> track as "10.0". Thus, even for 1 single record, we might have 2 SI records
>  
>  
> Also for timestamp, based on different timezone, even if they are the same 
> UTC time, but one may be in PST and the other is ET, they will give different 
> string and cause mismatch. The list can keep going as we evaluate all 
> possible data types.
>  
> {code:java}
> /**
>  * Constructs an iterator with a pair of the record key and the secondary 
> index value for each record in the file slice.
>  */
> private static <T> ClosableIterator<Pair<String, String>> 
> createSecondaryIndexRecordGenerator(HoodieReaderContext<T> readerContext,
>                                                                               
>                 HoodieTableMetaClient metaClient,                             
>                                       FileSlice fileSlice,                    
>                                          Schema tableSchema,                  
>                                            HoodieIndexDefinition 
> indexDefinition,                                          String instantTime, 
>                                                        TypedProperties props, 
>                                                        boolean 
> allowInflightInstants) throws IOException {
> while (recordIterator.hasNext()) {
>   T record = recordIterator.next();
>   Object secondaryKey = readerContext.getValue(record, tableSchema, 
> secondaryKeyField);
>   if (secondaryKey != null) {
>     nextValidRecord = Pair.of(
>         readerContext.getRecordKey(record, tableSchema),
>         secondaryKey.toString()
>     );
>     return true;
>   }
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to