[jira] [Updated] (HUDI-4459) Corrupt parquet file created when syncing huge table with 4000+ fields,using hudi cow table with bulk_insert type

Leo zhang (Jira) Mon, 25 Jul 2022 19:03:08 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Leo zhang updated HUDI-4459:
----------------------------
    Description: 
I am trying to sync a huge table with 4000+ fields into hudi, using cow table 
with bulk_insert  operate type.

The job can finished without any exception,but when I am trying to read data 
from the table,I get empty result.The parquet file is corrupted, can't be read 
correctly. 

I had tried to  trace the problem, and found it was caused by SortOperator. 
After the record is serialized in the sorter, all the field get disorder and is 
deserialized into one field.And finally the wrong record is written into 
parquet file,and make the file unreadable.

Here's a few steps to reproduce the bug in the flink sql-client:

1、execute the table ddl(provided in the table.ddl file  in the attachments)

2、execute the insert statement (provided in the statement.sql file  in the 
attachments)

3、execute a select statement to query hudi table  (provided in the 
statement.sql file  in the attachments)

  was:
I am trying to sync a huge table with 4000+ fields into hudi, using cow table 
with bulk_insert  operate type.

The job can finished without any exception,but when I am trying to read data 
from the table,I get empty result.The parquet file is corrupted, can't be read 
correctly. 

I had tried to  trace the problem, and found it was coused by SortOperator. 
After the record is serialized in the sorter, all the field get disorder and is 
deserialized into one field.And finally the wrong record is written into 
parquet file,and make the file unreadable.


Here's a few step to reproduce the bug ine the flink sql-client:

1、execute the table ddl(provided in the table.ddl file  in the attachments)

2、execute the insert statement (provided in the statement.sql file  in the 
attachments)

3、execute a select statement to query hudi table  (provided in the 
statement.sql file  in the attachments)


> Corrupt parquet file created when syncing huge table with 4000+ fields,using 
> hudi cow table with bulk_insert type
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: HUDI-4459
>                 URL: https://issues.apache.org/jira/browse/HUDI-4459
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Leo zhang
>            Assignee: Danny Chen
>            Priority: Major
>         Attachments: statements.sql, table.ddl
>
>
> I am trying to sync a huge table with 4000+ fields into hudi, using cow table 
> with bulk_insert  operate type.
> The job can finished without any exception,but when I am trying to read data 
> from the table,I get empty result.The parquet file is corrupted, can't be 
> read correctly. 
> I had tried to  trace the problem, and found it was caused by SortOperator. 
> After the record is serialized in the sorter, all the field get disorder and 
> is deserialized into one field.And finally the wrong record is written into 
> parquet file,and make the file unreadable.
> Here's a few steps to reproduce the bug in the flink sql-client:
> 1、execute the table ddl(provided in the table.ddl file  in the attachments)
> 2、execute the insert statement (provided in the statement.sql file  in the 
> attachments)
> 3、execute a select statement to query hudi table  (provided in the 
> statement.sql file  in the attachments)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-4459) Corrupt parquet file created when syncing huge table with 4000+ fields,using hudi cow table with bulk_insert type

Reply via email to