[
https://issues.apache.org/jira/browse/HUDI-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Leo zhang updated HUDI-4459:
----------------------------
Description:
I am trying to sync a huge table with 4000+ fields into hudi, using cow table
with bulk_insert operate type.
The job can finished without any exception,but when I am trying to read data
from the table,I get empty result.The parquet file is corrupted, can't be read
correctly.
I had tried to trace the problem, and found it was caused by SortOperator.
After the record is serialized in the sorter, all the field get disorder and is
deserialized into one field.And finally the wrong record is written into
parquet file,and make the file unreadable.
Here's a few steps to reproduce the bug in the flink sql-client:
1、execute the table ddl(provided in the table.ddl file in the attachments)
2、execute the insert statement (provided in the statement.sql file in the
attachments)
3、execute a select statement to query hudi table (provided in the
statement.sql file in the attachments)
was:
I am trying to sync a huge table with 4000+ fields into hudi, using cow table
with bulk_insert operate type.
The job can finished without any exception,but when I am trying to read data
from the table,I get empty result.The parquet file is corrupted, can't be read
correctly.
I had tried to trace the problem, and found it was coused by SortOperator.
After the record is serialized in the sorter, all the field get disorder and is
deserialized into one field.And finally the wrong record is written into
parquet file,and make the file unreadable.
Here's a few step to reproduce the bug ine the flink sql-client:
1、execute the table ddl(provided in the table.ddl file in the attachments)
2、execute the insert statement (provided in the statement.sql file in the
attachments)
3、execute a select statement to query hudi table (provided in the
statement.sql file in the attachments)
> Corrupt parquet file created when syncing huge table with 4000+ fields,using
> hudi cow table with bulk_insert type
> -----------------------------------------------------------------------------------------------------------------
>
> Key: HUDI-4459
> URL: https://issues.apache.org/jira/browse/HUDI-4459
> Project: Apache Hudi
> Issue Type: Bug
> Reporter: Leo zhang
> Assignee: Danny Chen
> Priority: Major
> Attachments: statements.sql, table.ddl
>
>
> I am trying to sync a huge table with 4000+ fields into hudi, using cow table
> with bulk_insert operate type.
> The job can finished without any exception,but when I am trying to read data
> from the table,I get empty result.The parquet file is corrupted, can't be
> read correctly.
> I had tried to trace the problem, and found it was caused by SortOperator.
> After the record is serialized in the sorter, all the field get disorder and
> is deserialized into one field.And finally the wrong record is written into
> parquet file,and make the file unreadable.
> Here's a few steps to reproduce the bug in the flink sql-client:
> 1、execute the table ddl(provided in the table.ddl file in the attachments)
> 2、execute the insert statement (provided in the statement.sql file in the
> attachments)
> 3、execute a select statement to query hudi table (provided in the
> statement.sql file in the attachments)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)