[
https://issues.apache.org/jira/browse/HUDI-9566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Davis Zhang updated HUDI-9566:
------------------------------
Description:
when generating secondary index record from a data column which could be of any
type, we convert that "any type" to string via "toString". RLI probably is
doing similar things.
h2. Read path
So SI lookup, the lookup key has to generate the matching string, despite that
the look up key provider may provide keys in various data types. This can cause
surprises.
For example, if a SI is built out of a float column, the SI will use
float::ToString and index using string literal "10.0", yet when we do index
lookup, we provide a lookup set of long/int type. It means the toString will
generate strings like "10"
10.0 == 10 is true, but "10.0".equals("10") is false
This means even if values are numerically the same, we can still fail to lookup
the value and thus cause correctness issues.
h2. Write path
As of today, SI update always generate records from the data column, and data
column is of a data type consistently, unless there is a schema evolution. As
long as no schema evolution happens, it should be fine.
With schema evolution changing data type (for example, int -> float), the
toString method will start to work differently. This means if previously SI
track string literal of int 10 as "10", now after bump up to float, it will
track as "10.0". Thus, even for 1 single record, we might have 2 SI records
{code:java}
/**
* Constructs an iterator with a pair of the record key and the secondary index
value for each record in the file slice.
*/
private static <T> ClosableIterator<Pair<String, String>>
createSecondaryIndexRecordGenerator(HoodieReaderContext<T> readerContext,
HoodieTableMetaClient metaClient,
FileSlice fileSlice,
Schema tableSchema,
HoodieIndexDefinition indexDefinition,
String instantTime,
TypedProperties props,
boolean allowInflightInstants) throws IOException {
while (recordIterator.hasNext()) {
T record = recordIterator.next();
Object secondaryKey = readerContext.getValue(record, tableSchema,
secondaryKeyField);
if (secondaryKey != null) {
nextValidRecord = Pair.of(
readerContext.getRecordKey(record, tableSchema),
secondaryKey.toString()
);
return true;
}
} {code}
was:
when generating secondary index record from a data column which could be of any
type, we convert that "any type" to string via "toString". RLI probably is
doing similar things.
h2. Read path
So SI lookup, the lookup key has to generate the matching string, despite that
the look up key provider may provide keys in various data types. This can cause
surprises.
For example, if a SI is built out of a float column, the SI will use
float::ToString and index using string literal "10.0", yet when we do index
lookup, we provide a lookup set of long/int type. It means the toString will
generate strings like "10"
10.0 == 10 is true, but "10.0".equals("10") is false
This means even if values are numerically the same, we can still fail to lookup
the value and thus cause correctness issues.
h3. Write path
As of today, SI update always generate records from the data column, and data
column is of a data type consistently, unless there is a schema evolution. As
long as no schema evolution happens, it should be fine.
{code:java}
/**
* Constructs an iterator with a pair of the record key and the secondary index
value for each record in the file slice.
*/
private static <T> ClosableIterator<Pair<String, String>>
createSecondaryIndexRecordGenerator(HoodieReaderContext<T> readerContext,
HoodieTableMetaClient metaClient,
FileSlice fileSlice,
Schema tableSchema,
HoodieIndexDefinition indexDefinition,
String instantTime,
TypedProperties props,
boolean allowInflightInstants) throws IOException {
while (recordIterator.hasNext()) {
T record = recordIterator.next();
Object secondaryKey = readerContext.getValue(record, tableSchema,
secondaryKeyField);
if (secondaryKey != null) {
nextValidRecord = Pair.of(
readerContext.getRecordKey(record, tableSchema),
secondaryKey.toString()
);
return true;
}
} {code}
> Secondary index convert everything to string
> --------------------------------------------
>
> Key: HUDI-9566
> URL: https://issues.apache.org/jira/browse/HUDI-9566
> Project: Apache Hudi
> Issue Type: Bug
> Reporter: Davis Zhang
> Priority: Major
>
> when generating secondary index record from a data column which could be of
> any type, we convert that "any type" to string via "toString". RLI probably
> is doing similar things.
> h2. Read path
> So SI lookup, the lookup key has to generate the matching string, despite
> that the look up key provider may provide keys in various data types. This
> can cause surprises.
>
> For example, if a SI is built out of a float column, the SI will use
> float::ToString and index using string literal "10.0", yet when we do index
> lookup, we provide a lookup set of long/int type. It means the toString will
> generate strings like "10"
> 10.0 == 10 is true, but "10.0".equals("10") is false
> This means even if values are numerically the same, we can still fail to
> lookup the value and thus cause correctness issues.
>
> h2. Write path
> As of today, SI update always generate records from the data column, and data
> column is of a data type consistently, unless there is a schema evolution. As
> long as no schema evolution happens, it should be fine.
>
> With schema evolution changing data type (for example, int -> float), the
> toString method will start to work differently. This means if previously SI
> track string literal of int 10 as "10", now after bump up to float, it will
> track as "10.0". Thus, even for 1 single record, we might have 2 SI records
>
>
> {code:java}
> /**
> * Constructs an iterator with a pair of the record key and the secondary
> index value for each record in the file slice.
> */
> private static <T> ClosableIterator<Pair<String, String>>
> createSecondaryIndexRecordGenerator(HoodieReaderContext<T> readerContext,
>
> HoodieTableMetaClient metaClient,
> FileSlice fileSlice,
> Schema tableSchema,
> HoodieIndexDefinition
> indexDefinition, String instantTime,
> TypedProperties props,
> boolean
> allowInflightInstants) throws IOException {
> while (recordIterator.hasNext()) {
> T record = recordIterator.next();
> Object secondaryKey = readerContext.getValue(record, tableSchema,
> secondaryKeyField);
> if (secondaryKey != null) {
> nextValidRecord = Pair.of(
> readerContext.getRecordKey(record, tableSchema),
> secondaryKey.toString()
> );
> return true;
> }
> } {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)