Davis Zhang created HUDI-9566:
---------------------------------

             Summary: Secondary index convert everything to string
                 Key: HUDI-9566
                 URL: https://issues.apache.org/jira/browse/HUDI-9566
             Project: Apache Hudi
          Issue Type: Bug
            Reporter: Davis Zhang


when generating secondary index record from a data column which could be of any 
type, we convert that "any type" to string via "toString". RLI probably is 
doing similar things.
h2. Read path

So SI lookup, the lookup key has to generate the matching string, despite that 
the look up key provider may provide keys in various data types. This can cause 
surprises.

 

For example, if a SI is built out of a float column, the SI will use 
float::ToString and index using string literal "10.0", yet when we do index 
lookup, we provide a lookup set of  long/int type. It means the toString will 
generate strings like "10"

10.0 == 10 is true, but "10.0".equals("10") is false

This means even if values are numerically the same, we can still fail to lookup 
the value and thus cause correctness issues. 

 
h3. Write path

As of today, SI update always generate records from the data column, and data 
column is of a data type consistently, unless there is a schema evolution. As 
long as no schema evolution happens, it should be fine. 

 
{code:java}
/**
 * Constructs an iterator with a pair of the record key and the secondary index 
value for each record in the file slice.
 */
private static <T> ClosableIterator<Pair<String, String>> 
createSecondaryIndexRecordGenerator(HoodieReaderContext<T> readerContext,
                                                                                
              HoodieTableMetaClient metaClient,                                 
                                  FileSlice fileSlice,                          
                                   Schema tableSchema,                          
                                   HoodieIndexDefinition indexDefinition,       
                                   String instantTime,                          
                              TypedProperties props,                            
                            boolean allowInflightInstants) throws IOException {

while (recordIterator.hasNext()) {
  T record = recordIterator.next();
  Object secondaryKey = readerContext.getValue(record, tableSchema, 
secondaryKeyField);
  if (secondaryKey != null) {
    nextValidRecord = Pair.of(
        readerContext.getRecordKey(record, tableSchema),
        secondaryKey.toString()
    );
    return true;
  }
} {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to