date:20220311

[GitHub] [hudi] hudi-bot commented on pull request #4984: [HUDI-3583] Fix MarkerBasedRollbackStrategy NoSuchElementException

2022-03-11 Thread GitBox



hudi-bot commented on pull request #4984:
URL: https://github.com/apache/hudi/pull/4984#issuecomment-1064867430


   
   ## CI report:
   
   * 015f7f0e07d3f0efbd8d3a728f802fc5572a8f52 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6694)
 
   * e30a63cc90f3afbea7ee36c37283f2f21ea7998f UNKNOWN
   * c0a0e141561d1d75150aab046090e1ccd1c9e2c2 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6834)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4984: [HUDI-3583] Fix MarkerBasedRollbackStrategy NoSuchElementException

2022-03-11 Thread GitBox



hudi-bot removed a comment on pull request #4984:
URL: https://github.com/apache/hudi/pull/4984#issuecomment-1064865517


   
   ## CI report:
   
   * 015f7f0e07d3f0efbd8d3a728f802fc5572a8f52 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6694)
 
   * e30a63cc90f3afbea7ee36c37283f2f21ea7998f UNKNOWN
   * c0a0e141561d1d75150aab046090e1ccd1c9e2c2 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4925: [HUDI-3103] Enable MultiTableDeltaStreamer to update a single sink table from multiple source tables

2022-03-11 Thread GitBox



hudi-bot commented on pull request #4925:
URL: https://github.com/apache/hudi/pull/4925#issuecomment-1064873216


   
   ## CI report:
   
   * 018bb851445f7eabaa0bd4cc2b362f269d6fec59 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6436)
 
   * 7119319af35fb23afa97e058cd2fbfaea18292a1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6831)
 
   * 82acc7daea301fa4e373c1ff2570a60ca1da6ce3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4925: [HUDI-3103] Enable MultiTableDeltaStreamer to update a single sink table from multiple source tables

2022-03-11 Thread GitBox



hudi-bot removed a comment on pull request #4925:
URL: https://github.com/apache/hudi/pull/4925#issuecomment-1064843916


   
   ## CI report:
   
   * 018bb851445f7eabaa0bd4cc2b362f269d6fec59 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6436)
 
   * 7119319af35fb23afa97e058cd2fbfaea18292a1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6831)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #5017: [HUDI-3606] Add `org.objenesis:objenesis` to hudi-timeline-server-bundle pom

2022-03-11 Thread GitBox



hudi-bot removed a comment on pull request #5017:
URL: https://github.com/apache/hudi/pull/5017#issuecomment-1064800761


   
   ## CI report:
   
   * d1211dd592bcb9e3df60b80b9585d2eda9f0b8ab Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6823)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5017: [HUDI-3606] Add `org.objenesis:objenesis` to hudi-timeline-server-bundle pom

2022-03-11 Thread GitBox



hudi-bot commented on pull request #5017:
URL: https://github.com/apache/hudi/pull/5017#issuecomment-1064873400


   
   ## CI report:
   
   * d1211dd592bcb9e3df60b80b9585d2eda9f0b8ab Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6823)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4925: [HUDI-3103] Enable MultiTableDeltaStreamer to update a single sink table from multiple source tables

2022-03-11 Thread GitBox



hudi-bot removed a comment on pull request #4925:
URL: https://github.com/apache/hudi/pull/4925#issuecomment-1064873216


   
   ## CI report:
   
   * 018bb851445f7eabaa0bd4cc2b362f269d6fec59 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6436)
 
   * 7119319af35fb23afa97e058cd2fbfaea18292a1 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6831)
 
   * 82acc7daea301fa4e373c1ff2570a60ca1da6ce3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4925: [HUDI-3103] Enable MultiTableDeltaStreamer to update a single sink table from multiple source tables

2022-03-11 Thread GitBox



hudi-bot commented on pull request #4925:
URL: https://github.com/apache/hudi/pull/4925#issuecomment-1064875199


   
   ## CI report:
   
   * 7119319af35fb23afa97e058cd2fbfaea18292a1 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6831)
 
   * 82acc7daea301fa4e373c1ff2570a60ca1da6ce3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4925: [HUDI-3103] Enable MultiTableDeltaStreamer to update a single sink table from multiple source tables

2022-03-11 Thread GitBox



hudi-bot commented on pull request #4925:
URL: https://github.com/apache/hudi/pull/4925#issuecomment-1064883259


   
   ## CI report:
   
   * 7119319af35fb23afa97e058cd2fbfaea18292a1 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6831)
 
   * 82acc7daea301fa4e373c1ff2570a60ca1da6ce3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6835)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4925: [HUDI-3103] Enable MultiTableDeltaStreamer to update a single sink table from multiple source tables

2022-03-11 Thread GitBox



hudi-bot removed a comment on pull request #4925:
URL: https://github.com/apache/hudi/pull/4925#issuecomment-1064875199


   
   ## CI report:
   
   * 7119319af35fb23afa97e058cd2fbfaea18292a1 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6831)
 
   * 82acc7daea301fa4e373c1ff2570a60ca1da6ce3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4489: [HUDI-3135] Fix Delete partitions with metadata table and fix show partitions in spark sql

2022-03-11 Thread GitBox



hudi-bot commented on pull request #4489:
URL: https://github.com/apache/hudi/pull/4489#issuecomment-1064900633


   
   ## CI report:
   
   * d17343318be38b5a9b0953004700aa72f4fed689 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6809)
 
   * 44942ace20195bb284b5ce7e792462865255f2e0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4489: [HUDI-3135] Fix Delete partitions with metadata table and fix show partitions in spark sql

2022-03-11 Thread GitBox



hudi-bot removed a comment on pull request #4489:
URL: https://github.com/apache/hudi/pull/4489#issuecomment-1064751722


   
   ## CI report:
   
   * d17343318be38b5a9b0953004700aa72f4fed689 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6809)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5018: [HUDI-3559] fix flink Bucket Index with COW table type `NoSuchElementException` cause of deduplicateRecords method in FlinkWriteHelper out of

2022-03-11 Thread GitBox



hudi-bot commented on pull request #5018:
URL: https://github.com/apache/hudi/pull/5018#issuecomment-1064901069


   
   ## CI report:
   
   * b9e437b2c2942ba29945d1d21c7e214e350e4333 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6825)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #5018: [HUDI-3559] fix flink Bucket Index with COW table type `NoSuchElementException` cause of deduplicateRecords method in FlinkWriteHelper

2022-03-11 Thread GitBox



hudi-bot removed a comment on pull request #5018:
URL: https://github.com/apache/hudi/pull/5018#issuecomment-1064807946


   
   ## CI report:
   
   * b9e437b2c2942ba29945d1d21c7e214e350e4333 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6825)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4489: [HUDI-3135] Fix Delete partitions with metadata table and fix show partitions in spark sql

2022-03-11 Thread GitBox



hudi-bot commented on pull request #4489:
URL: https://github.com/apache/hudi/pull/4489#issuecomment-1064902635


   
   ## CI report:
   
   * d17343318be38b5a9b0953004700aa72f4fed689 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6809)
 
   * 44942ace20195bb284b5ce7e792462865255f2e0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6836)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4489: [HUDI-3135] Fix Delete partitions with metadata table and fix show partitions in spark sql

2022-03-11 Thread GitBox



hudi-bot removed a comment on pull request #4489:
URL: https://github.com/apache/hudi/pull/4489#issuecomment-1064900633


   
   ## CI report:
   
   * d17343318be38b5a9b0953004700aa72f4fed689 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6809)
 
   * 44942ace20195bb284b5ce7e792462865255f2e0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nloneday commented on issue #4943: [SUPPORT] NoClassDefFoundError: org/apache/hudi/org/apache/hadoop/hive/metastore/api/NoSuchObjectException

2022-03-11 Thread GitBox



nloneday commented on issue #4943:
URL: https://github.com/apache/hudi/issues/4943#issuecomment-1064910256


   > > > 
   > > 
   > > 
   > > @danny0405 remote the hive shade pattern in the hudi flink bundle jar ? 
How can I fix it now?
   > 
   > You may take a reference of this pom: 
https://github.com/apache/hudi/blob/master/packaging/hudi-flink-bundle/pom.xml
   
   Removing the hive shade pattern in the hudi flink bundle jar can't work. The 
simplest way to resolve this problem is to use 
https://github.com/apache/hudi/blob/master/packaging/hudi-flink-bundle/pom.xml 
pom file, just downgrade the project version to 0.10.1.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4971: [HUDI-3556] Re-use rollback instant for rolling back of clustering and compaction if rollback failed mid-way

2022-03-11 Thread GitBox



hudi-bot removed a comment on pull request #4971:
URL: https://github.com/apache/hudi/pull/4971#issuecomment-1064815294


   
   ## CI report:
   
   * 74ace6ca3f717a41d54047bb44ea52fedb94e1ce Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6810)
 
   * 7367ebfc60119b4442988ebc7350e4daac15b65f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6826)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4971: [HUDI-3556] Re-use rollback instant for rolling back of clustering and compaction if rollback failed mid-way

2022-03-11 Thread GitBox



hudi-bot commented on pull request #4971:
URL: https://github.com/apache/hudi/pull/4971#issuecomment-1064923276


   
   ## CI report:
   
   * 7367ebfc60119b4442988ebc7350e4daac15b65f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6826)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-3608) Support Tinyint and small int data types

2022-03-11 Thread Danny Chen (Jira)

Danny Chen created HUDI-3608:


 Summary: Support Tinyint and small int data types
 Key: HUDI-3608
 URL: https://issues.apache.org/jira/browse/HUDI-3608
 Project: Apache Hudi
  Issue Type: Task
  Components: core
Reporter: Danny Chen
 Fix For: 0.11.0






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[GitHub] [hudi] danny0405 commented on issue #4998: [SUPPORT] Tinyint and small int data types

2022-03-11 Thread GitBox



danny0405 commented on issue #4998:
URL: https://github.com/apache/hudi/issues/4998#issuecomment-1064924607


   issue created: https://issues.apache.org/jira/browse/HUDI-3608


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 closed issue #4998: [SUPPORT] Tinyint and small int data types

2022-03-11 Thread GitBox



danny0405 closed issue #4998:
URL: https://github.com/apache/hudi/issues/4998


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-3608) Support Tinyint and small int data types

2022-03-11 Thread Danny Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-3608:
-
Description: Currently, the avro schema does not support tinyint and 
smallint datatypes, but Hudi uses the avro schema as bridge of user DDL schema 
and parquet schema, we should fix that.

> Support Tinyint and small int data types
> 
>
> Key: HUDI-3608
> URL: https://issues.apache.org/jira/browse/HUDI-3608
> Project: Apache Hudi
>  Issue Type: Task
>  Components: core
>Reporter: Danny Chen
>Priority: Major
> Fix For: 0.11.0
>
>
> Currently, the avro schema does not support tinyint and smallint datatypes, 
> but Hudi uses the avro schema as bridge of user DDL schema and parquet 
> schema, we should fix that.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[GitHub] [hudi] danny0405 opened a new issue #5020: [SUPPORT] The cleaning strategy breaks the reader view completeness

2022-03-11 Thread GitBox



danny0405 opened a new issue #5020:
URL: https://github.com/apache/hudi/issues/5020


   Current we have some cleaning strategy such as: `num_commits`, `delta 
hours`, `num_versions`.
   Let's say user use the `num_commits` strategy.
   
   And it uses the params:
   
   - max 10 commits to archive
   - min 4 commits to keep in alive
   - 6 commits to clean
   
   c1  c2  c3  c4  c5  c6  c7 c8  c9  c10
   
   At c10, the reader starts reading the latest fs view with a file slice that 
was written in c1,
   
   /+
 --- fg1_c1.parquet
   
   And the cleaner also starts working in c10 this time, it finds that the num 
commits > 6 (10 > 6) and all the files that committed in c1 ~ c4 was deleted. 
And the reader throws `FileNotFoundException`.
   
   This problem is common and occurs frequently especially in streaming read 
mode.(also happens if a batch read job is complex and lasts long time).
   
   We need some mechanisms to ensure the semantic integrity of the read view.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] prashantwason commented on a change in pull request #4693: [WIP][HUDI-3175][RFC-45] Implement async metadata indexing

2022-03-11 Thread GitBox



prashantwason commented on a change in pull request #4693:
URL: https://github.com/apache/hudi/pull/4693#discussion_r824490905



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java
##
@@ -855,6 +856,21 @@ public boolean scheduleCompactionAtInstant(String 
instantTime, Option scheduleIndexing(List partitions) {
+String instantTime = HoodieActiveTimeline.createNewInstantTime();
+return scheduleIndexingAtInstant(partitions, instantTime) ? 
Option.of(instantTime) : Option.empty();
+  }
+
+  private boolean scheduleIndexingAtInstant(List partitionsToIndex, 
String instantTime) throws HoodieIOException {

Review comment:
   This being a private function only called from the above function, why 
not merge it to scheduleIndexing?

##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/index/RunIndexActionExecutor.java
##
@@ -0,0 +1,185 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.table.action.index;
+
+import org.apache.hudi.avro.model.HoodieCleanMetadata;
+import org.apache.hudi.avro.model.HoodieIndexCommitMetadata;
+import org.apache.hudi.avro.model.HoodieIndexPartitionInfo;
+import org.apache.hudi.avro.model.HoodieIndexPlan;
+import org.apache.hudi.avro.model.HoodieRestoreMetadata;
+import org.apache.hudi.avro.model.HoodieRollbackMetadata;
+import org.apache.hudi.common.engine.HoodieEngineContext;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.table.timeline.TimelineMetadataUtils;
+import org.apache.hudi.common.util.CleanerUtils;
+import org.apache.hudi.common.util.HoodieTimer;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieIndexException;
+import org.apache.hudi.metadata.HoodieTableMetadata;
+import org.apache.hudi.metadata.HoodieTableMetadataWriter;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.hudi.table.action.BaseActionExecutor;
+
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Set;
+import java.util.concurrent.ExecutionException;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.concurrent.Future;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.TimeoutException;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+public class RunIndexActionExecutor 
extends BaseActionExecutor> {
+
+  private static final Logger LOG = 
LogManager.getLogger(RunIndexActionExecutor.class);
+  private static final Integer INDEX_COMMIT_METADATA_VERSION_1 = 1;
+  private static final Integer LATEST_INDEX_COMMIT_METADATA_VERSION = 
INDEX_COMMIT_METADATA_VERSION_1;
+  private static final int MAX_CONCURRENT_INDEXING = 1;
+
+  public RunIndexActionExecutor(HoodieEngineContext context, HoodieWriteConfig 
config, HoodieTable table, String instantTime) {
+super(context, config, table, instantTime);
+  }
+
+  @Override
+  public Option execute() {
+HoodieTimer indexTimer = new HoodieTimer();
+indexTimer.startTimer();
+
+HoodieInstant indexInstant = table.getActiveTimeline()
+.filterPendingIndexTimeline()
+.filter(instant -> instant.getTimestamp().equals(instantTime))
+.lastInstant()
+.orElseThrow(() -> new HoodieIndexException(String.format("No pending 
index instant found: %s", instantTime)));
+
ValidationUtils.checkArgument(HoodieInstant.State.INFLIGHT.equals(indexInstant.getState()),
+String.format("Index instant %s already inflight", instantTime));
+try {
+  // read HoodieIndexPlan assuming indexInstant is requested
+  // TODO: handle inflight i

[GitHub] [hudi] hudi-bot removed a comment on pull request #4982: [HUDI-3567] Refactor HoodieCommonUtils to make code more reasonable

2022-03-11 Thread GitBox



hudi-bot removed a comment on pull request #4982:
URL: https://github.com/apache/hudi/pull/4982#issuecomment-1064820630


   
   ## CI report:
   
   * 282ca401f8e2a93d7703f592041b854959291d41 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6805)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6817)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6827)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4982: [HUDI-3567] Refactor HoodieCommonUtils to make code more reasonable

2022-03-11 Thread GitBox



hudi-bot commented on pull request #4982:
URL: https://github.com/apache/hudi/pull/4982#issuecomment-1064953069


   
   ## CI report:
   
   * 282ca401f8e2a93d7703f592041b854959291d41 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6805)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6817)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6827)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] prashantwason commented on a change in pull request #4693: [WIP][HUDI-3175][RFC-45] Implement async metadata indexing

2022-03-11 Thread GitBox



prashantwason commented on a change in pull request #4693:
URL: https://github.com/apache/hudi/pull/4693#discussion_r824571450



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java
##
@@ -118,18 +124,18 @@
   /**
* Hudi backed table metadata writer.
*
-   * @param hadoopConf   - Hadoop configuration to use for the 
metadata writer
-   * @param writeConfig  - Writer config
-   * @param engineContext- Engine context
-   * @param actionMetadata   - Optional action metadata to help decide 
bootstrap operations
-   * @param   - Action metadata types extending Avro 
generated SpecificRecordBase
+   * @param hadoopConf - Hadoop configuration to use for the metadata writer
+   * @param writeConfig - Writer config
+   * @param engineContext - Engine context
+   * @param actionMetadata - Optional action metadata to help decide bootstrap 
operations
+   * @param  - Action metadata types extending Avro generated 
SpecificRecordBase
* @param inflightInstantTimestamp - Timestamp of any instant in progress
*/
   protected  
HoodieBackedTableMetadataWriter(Configuration hadoopConf,
-   
HoodieWriteConfig writeConfig,
-   
HoodieEngineContext engineContext,
-   
Option actionMetadata,
-   
Option inflightInstantTimestamp) {
+  HoodieWriteConfig writeConfig,

Review comment:
   +1 as we lose commit history of these lines and it bloats the diffs.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a change in pull request #4962: [HUDI-3355] Issue with out of order commits in the timeline when ingestion writers using SparkAllowUpdateStrategy

2022-03-11 Thread GitBox



xushiyan commented on a change in pull request #4962:
URL: https://github.com/apache/hudi/pull/4962#discussion_r824584368



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java
##
@@ -120,6 +121,7 @@
   protected transient AsyncArchiveService asyncArchiveService;
   protected final TransactionManager txnManager;
   protected Option>> 
lastCompletedTxnAndMetadata = Option.empty();
+  protected List unCheckedPendingClusteringInstants = new 
ArrayList<>();

Review comment:
   also applies to other occurrences
   
   ```suggestion
 protected List uncheckedPendingClusteringInstants = new 
ArrayList<>();
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] wangxianghu commented on pull request #4987: [HUDI-3547] Introduce MaxwellSourcePostProcessor to extract data from Maxwell json string

2022-03-11 Thread GitBox



wangxianghu commented on pull request #4987:
URL: https://github.com/apache/hudi/pull/4987#issuecomment-1064986515


   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4987: [HUDI-3547] Introduce MaxwellSourcePostProcessor to extract data from Maxwell json string

2022-03-11 Thread GitBox



hudi-bot removed a comment on pull request #4987:
URL: https://github.com/apache/hudi/pull/4987#issuecomment-1063269660


   
   ## CI report:
   
   * 84d9028db0242b31b9fcee5cc27a361bd3c987ae UNKNOWN
   * 1077705483682eca8c063671fcacdf73740dacdb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6753)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4987: [HUDI-3547] Introduce MaxwellSourcePostProcessor to extract data from Maxwell json string

2022-03-11 Thread GitBox



hudi-bot commented on pull request #4987:
URL: https://github.com/apache/hudi/pull/4987#issuecomment-1064987781


   
   ## CI report:
   
   * 84d9028db0242b31b9fcee5cc27a361bd3c987ae UNKNOWN
   * 1077705483682eca8c063671fcacdf73740dacdb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6753)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6837)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4264: [HUDI-2875] Make HoodieParquetWriter Thread safe and memory executor …

2022-03-11 Thread GitBox



hudi-bot commented on pull request #4264:
URL: https://github.com/apache/hudi/pull/4264#issuecomment-1064991300


   
   ## CI report:
   
   * 4a3662cb03d0fbf4f5041b9b27eebd03cd132783 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6829)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4264: [HUDI-2875] Make HoodieParquetWriter Thread safe and memory executor …

2022-03-11 Thread GitBox



hudi-bot removed a comment on pull request #4264:
URL: https://github.com/apache/hudi/pull/4264#issuecomment-1064838694


   
   ## CI report:
   
   * 4a9c78781cc4efcf3f13d6f12836b6fc3e738878 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6824)
 
   * 4a3662cb03d0fbf4f5041b9b27eebd03cd132783 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6829)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] leesf merged pull request #4926: [HUDI-3566]add thread factory in BoundedInMemoryExecutor

2022-03-11 Thread GitBox



leesf merged pull request #4926:
URL: https://github.com/apache/hudi/pull/4926


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated (18cdad9 -> faed699)

2022-03-11 Thread leesf

This is an automated email from the ASF dual-hosted git repository.

leesf pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git.


from 18cdad9  [HUDI-2999] [RFC-42] RFC for consistent hashing index (#4326)
 add faed699  [HUDI-3566] Add thread factory in BoundedInMemoryExecutor 
(#4926)

No new revisions were added by this update.

Summary of changes:
 .../hudi/common/util/CustomizedThreadFactory.java  | 56 +++
 .../common/util/queue/BoundedInMemoryExecutor.java | 32 +
 .../common/util/TestCustomizedThreadFactory.java   | 79 ++
 3 files changed, 154 insertions(+), 13 deletions(-)
 create mode 100644 
hudi-common/src/main/java/org/apache/hudi/common/util/CustomizedThreadFactory.java
 create mode 100644 
hudi-common/src/test/java/org/apache/hudi/common/util/TestCustomizedThreadFactory.java

[GitHub] [hudi] hudi-bot removed a comment on pull request #5019: [HUDI-3575] Use HoodieTestDataGenerator#TRIP_SCHEMA as example schema in TestSchemaPostProcessor

2022-03-11 Thread GitBox



hudi-bot removed a comment on pull request #5019:
URL: https://github.com/apache/hudi/pull/5019#issuecomment-1064842384


   
   ## CI report:
   
   * 3b6b326bb3650689e8ad78504ccaca3df2700998 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6830)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5019: [HUDI-3575] Use HoodieTestDataGenerator#TRIP_SCHEMA as example schema in TestSchemaPostProcessor

2022-03-11 Thread GitBox



hudi-bot commented on pull request #5019:
URL: https://github.com/apache/hudi/pull/5019#issuecomment-1065006472


   
   ## CI report:
   
   * 3b6b326bb3650689e8ad78504ccaca3df2700998 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6830)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] wangxianghu merged pull request #5019: [HUDI-3575] Use HoodieTestDataGenerator#TRIP_SCHEMA as example schema in TestSchemaPostProcessor

2022-03-11 Thread GitBox



wangxianghu merged pull request #5019:
URL: https://github.com/apache/hudi/pull/5019


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated (faed699 -> b001803)

2022-03-11 Thread wangxianghu

This is an automated email from the ASF dual-hosted git repository.

wangxianghu pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git.


from faed699  [HUDI-3566] Add thread factory in BoundedInMemoryExecutor 
(#4926)
 add b001803  [HUDI-3575] Use HoodieTestDataGenerator#TRIP_SCHEMA as 
example schema in TestSchemaPostProcessor (#5019)

No new revisions were added by this update.

Summary of changes:
 .../hudi/utilities/TestSchemaPostProcessor.java| 25 --
 1 file changed, 14 insertions(+), 11 deletions(-)

[jira] [Closed] (HUDI-3575) Use HoodieTestDataGenerator#TRIP_SCHEMA as example schema in TestSchemaPostProcessor

2022-03-11 Thread Xianghu Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianghu Wang closed HUDI-3575.
--
Resolution: Fixed

> Use HoodieTestDataGenerator#TRIP_SCHEMA as example schema in 
> TestSchemaPostProcessor
> 
>
> Key: HUDI-3575
> URL: https://issues.apache.org/jira/browse/HUDI-3575
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: Xianghu Wang
>Assignee: Xianghu Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[GitHub] [hudi] codope commented on a change in pull request #4640: [HUDI-3225] [RFC-45] for async metadata indexing

2022-03-11 Thread GitBox



codope commented on a change in pull request #4640:
URL: https://github.com/apache/hudi/pull/4640#discussion_r824623608



##
File path: rfc/rfc-45/rfc-45.md
##
@@ -0,0 +1,264 @@
+
+
+# RFC-45: Asynchronous Metadata Indexing
+
+## Proposers
+
+- @codope
+- @manojpec
+
+## Approvers
+
+- @nsivabalan
+- @vinothchandar
+
+## Status
+
+JIRA: [HUDI-2488](https://issues.apache.org/jira/browse/HUDI-2488)
+
+## Abstract
+
+Metadata indexing (aka metadata bootstrapping) is the process of creation of 
one
+or more metadata-based indexes, e.g. data partitions to files index, that is
+stored in Hudi metadata table. Currently, the metadata table (referred as MDT
+hereafter) supports single partition which is created synchronously with the
+corresponding data table, i.e. commits are first applied to metadata table
+followed by data table. Our goal for MDT is to support multiple partitions to
+boost the performance of existing index and records lookup. However, the
+synchronous manner of metadata indexing is not very scalable as we add more
+partitions to the MDT because the regular writers (writing to the data table)
+have to wait until the MDT commit completes. In this RFC, we propose a design 
to
+support asynchronous metadata indexing.
+
+## Background
+
+We can read more about the MDT design
+in 
[RFC-15](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements)
+. Here is a quick summary of the current state (Hudi v0.10.1). MDT is an
+internal Merge-on-Read (MOR) table that has a single partition called `files`
+which stores the data partitions to files index that is used in file listing.
+MDT is co-located with the data table (inside `.hoodie/metadata` directory 
under
+the basepath). In order to handle multi-writer scenario, users configure lock
+provider and only one writer can access MDT in read-write mode. Hence, any 
write
+to MDT is guarded by the data table lock. This ensures only one write is
+committed to MDT at any point in time and thus guarantees serializability.
+However, locking overhead adversely affects the write throughput and will reach
+its scalability limits as we add more partitions to the MDT.
+
+## Goals
+
+- Support indexing one or more partitions in MDT while regular writers and 
table
+  services (such as cleaning or compaction) are in progress.
+- Locking to be as lightweight as possible.
+- Keep required config changes to a minimum to simplify deployment / upgrade in
+  production.
+- Do not require specific ordering of how writers and table service pipelines
+  need to be upgraded / restarted.
+- If an external long-running process is being used to initialize the index, 
the
+  process should be made idempotent so it can handle errors from previous runs.
+- To re-initialize the index, make it as simple as running the external
+  initialization process again without having to change configs.
+
+## Implementation
+
+### A new Hudi action: INDEX
+
+We introduce a new action `index` which will denote the index building process,
+the mechanics of which is as follows:
+
+1. From an external process, users can issue a CREATE INDEX or similar 
statement
+   to trigger indexing for an existing table.
+1. This will schedule INDEX action and add
+   a `.index.requested` to the timeline, which contains the
+   indexing plan. Index scheduling will also initialize the filegroup for
+   the partitions for which indexing is planned.
+2. From here on, the index building process will continue to build an index
+   up to instant time `t`, where `t` is the latest completed instant time 
on
+   the timeline without any
+   "holes" i.e. no pending async operations prior to it.
+3. The indexing process will write these out as base files within the
+   corresponding metadata partition. A metadata partition cannot be used if
+   there is any pending indexing action against it. As and when indexing is
+   completed for a partition, then table config (`hoodie.properties`) will
+   be updated to indicate that partition is available for reads or
+   synchronous updates. Hudi table config will be the source of truth for
+   the current state of metadata index.
+
+2. Any inflight writers (i.e. with instant time `t'` > `t`)  will check for any
+   new indexing request on the timeline prior to preparing to commit.
+1. Such writers will proceed to additionally add log entries corresponding
+   to each such indexing request into the metadata partition.
+2. There is always a TOCTOU issue here, where the inflight writer may not
+   see an indexing request that was just added and proceed to commit 
without
+   that. We will correct this during indexing action completion. In the
+   average case, this may not happen and the design has liveness.
+
+3. When the indexing process is about to complete (i.e. indexing upto
+   instant `t` is done but before completing indexing commit), it will ch

[GitHub] [hudi] hudi-bot commented on pull request #4872: [HUDI-3475] Support run compaction / clustering job in Service

2022-03-11 Thread GitBox



hudi-bot commented on pull request #4872:
URL: https://github.com/apache/hudi/pull/4872#issuecomment-1065023815


   
   ## CI report:
   
   * c662e400cd71c1dbba9b4f37512ca5e748736f03 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6833)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4872: [HUDI-3475] Support run compaction / clustering job in Service

2022-03-11 Thread GitBox



hudi-bot removed a comment on pull request #4872:
URL: https://github.com/apache/hudi/pull/4872#issuecomment-1064853531


   
   ## CI report:
   
   * 0fd561ae050f39c022862eae351c73b323a61e05 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6209)
 
   * c662e400cd71c1dbba9b4f37512ca5e748736f03 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6833)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a change in pull request #4175: [HUDI-2883] Refactor hive sync tool / config to use reflection and standardize configs

2022-03-11 Thread GitBox



xushiyan commented on a change in pull request #4175:
URL: https://github.com/apache/hudi/pull/4175#discussion_r824606178



##
File path: 
hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/testutils/HiveTestUtil.java
##
@@ -89,16 +89,21 @@
 @SuppressWarnings("SameParameterValue")
 public class HiveTestUtil {
 
+  public static final String DB_NAME = "testdb";
+  public static String TABLE_NAME = "test1";

Review comment:
   TABLE_NAME also final?

##
File path: 
hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/testutils/HiveTestUtil.java
##
@@ -112,16 +117,21 @@ public static void setUp() throws IOException, 
InterruptedException, HiveExcepti
 }
 fileSystem = FileSystem.get(configuration);
 
-hiveSyncConfig = new HiveSyncConfig();
-hiveSyncConfig.jdbcUrl = hiveTestService.getJdbcHive2Url();
-hiveSyncConfig.hiveUser = "";
-hiveSyncConfig.hivePass = "";
-hiveSyncConfig.databaseName = "testdb";
-hiveSyncConfig.tableName = "test1";
-hiveSyncConfig.basePath = Files.createTempDirectory("hivesynctest" + 
Instant.now().toEpochMilli()).toUri().toString();
-hiveSyncConfig.assumeDatePartitioning = true;
-hiveSyncConfig.usePreApacheInputFormat = false;
-hiveSyncConfig.partitionFields = Collections.singletonList("datestr");
+basePath = Files.createTempDirectory("hivesynctest" + 
Instant.now().toEpochMilli()).toUri().toString();
+
+hiveSyncProps = new TypedProperties();
+hiveSyncProps.setProperty(HiveSyncConfig.HIVE_URL.key(), 
hiveTestService.getJdbcHive2Url());
+hiveSyncProps.setProperty(HiveSyncConfig.HIVE_USER.key(), "");
+hiveSyncProps.setProperty(HiveSyncConfig.HIVE_PASS.key(), "");
+hiveSyncProps.setProperty(HiveSyncConfig.META_SYNC_DATABASE_NAME.key(), 
DB_NAME);
+hiveSyncProps.setProperty(HiveSyncConfig.META_SYNC_TABLE_NAME.key(), 
TABLE_NAME);
+hiveSyncProps.setProperty(HiveSyncConfig.META_SYNC_BASE_PATH, basePath);
+
hiveSyncProps.setProperty(HiveSyncConfig.META_SYNC_ASSUME_DATE_PARTITION.key(), 
"true");
+
hiveSyncProps.setProperty(HiveSyncConfig.HIVE_USE_PRE_APACHE_INPUT_FORMAT.key(),
 "false");
+hiveSyncProps.setProperty(HiveSyncConfig.META_SYNC_PARTITION_FIELDS.key(), 
"datestr");
+
hiveSyncProps.setProperty(HiveSyncConfig.HIVE_BATCH_SYNC_PARTITION_NUM.key(), 
"3");
+
+hiveSyncConfig = new HiveSyncConfig(hiveSyncProps);

Review comment:
   i'd prefer to have builder pattern to reliably construct 
`HiveSyncConfig` instead of passing uncontrollable "raw" props.

##
File path: 
hudi-sync/hudi-sync-common/src/test/java/org/apache/hudi/sync/common/util/TestSyncUtilHelpers.java
##
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.sync.common.util;
+
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.sync.common.AbstractSyncTool;
+
+import java.io.IOException;
+import java.util.Properties;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+
+import static org.junit.jupiter.api.Assertions.assertThrows;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+public class TestSyncUtilHelpers {
+  private static final String BASE_PATH = "/tmp/test";
+  private static final String BASE_FORMAT = "PARQUET";
+
+  private static Configuration configuration;
+  private static FileSystem fileSystem;

Review comment:
   ```suggestion
 private Configuration configuration;
 private FileSystem fileSystem;
   ```
   
   static vars could run into issues if enable parallel testing 

##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/BootstrapExecutor.java
##
@@ -161,12 +160,16 @@ public void execute() throws IOException {
*/
   private void syncHive() {
 if (cfg.enableHiveSync || cfg.enableMetaSync) {
-  HiveSyncConfig hiveSyncConfig = 
DataSourceUtils.buildHiveSyncConfig(props, cfg.targetBasePath, 
cfg.baseFileFormat);
-  HiveConf hiveConf = new HiveConf(fs.getConf(), HiveConf.cla

[GitHub] [hudi] hudi-bot removed a comment on pull request #5013: [HUDI-3593] Restore TypedProperties and flush checksum in table config

2022-03-11 Thread GitBox



hudi-bot removed a comment on pull request #5013:
URL: https://github.com/apache/hudi/pull/5013#issuecomment-1064845719


   
   ## CI report:
   
   * a2e2b2ecd3ffe2974fac5e6472c2ab273f4d13c4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6800)
 
   * f50fc2686b0c3b7f17c741ca99db9629aafc6b66 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6832)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #5013: [HUDI-3593] Restore TypedProperties and flush checksum in table config

2022-03-11 Thread GitBox



hudi-bot commented on pull request #5013:
URL: https://github.com/apache/hudi/pull/5013#issuecomment-1065024087


   
   ## CI report:
   
   * f50fc2686b0c3b7f17c741ca99db9629aafc6b66 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6832)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] huberylee commented on pull request #4982: [HUDI-3567] Refactor HoodieCommonUtils to make code more reasonable

2022-03-11 Thread GitBox



huberylee commented on pull request #4982:
URL: https://github.com/apache/hudi/pull/4982#issuecomment-1065024812


   > @huberylee thanks for cleaning things up!
   
   @alexeykudinkin All comments have been addressed. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] boneanxs commented on a change in pull request #5013: [HUDI-3593] Restore TypedProperties and flush checksum in table config

2022-03-11 Thread GitBox



boneanxs commented on a change in pull request #5013:
URL: https://github.com/apache/hudi/pull/5013#discussion_r824633524



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/config/OrderedProperties.java
##
@@ -0,0 +1,150 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.common.config;
+
+import java.util.Collections;
+import java.util.Enumeration;
+import java.util.HashSet;
+import java.util.LinkedHashSet;
+import java.util.Map;
+import java.util.Objects;
+import java.util.Properties;
+import java.util.Set;
+
+/**
+ * An extension of {@link java.util.Properties} that maintains the order.

Review comment:
   nit: Can note this property is not threadsafe




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-3607) Support backend switch in HoodieFlinkStreamer

2022-03-11 Thread Xianghu Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-3607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504872#comment-17504872
 ] 

Xianghu Wang commented on HUDI-3607:


hi [~liufangqi] 

The reasion you can't assign it to yourself is that you don't have the 
contribution permission yet, pealse apply it.

you can refer this:

[https://mp.weixin.qq.com/s?__biz=MzIyMzQ0NjA0MQ==&mid=2247484782&idx=3&sn=e1c121710c14680e0bf69349eb68424c&chksm=e81f5018df68d90e81b98beb10ff47bc0217774799126abb5ac4a04e25ebb26cf0342aef7016&token=1688466117&lang=zh_CN#rd]

By the way, you still can push your patch first

 

> Support backend switch in HoodieFlinkStreamer
> -
>
> Key: HUDI-3607
> URL: https://issues.apache.org/jira/browse/HUDI-3607
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: 刘方奇
>Priority: Major
>
> Now, HoodieFlinkStreamer utility only support one backend - FsStateBackend.
> I think it's not flexible for the application configuration. Could we make 
> backend configurable?
> Moreover, for flink version 1.14, FsStateBackend is deprecated in favor of 
> org.apache.flink.runtime.state.hashmap.HashMapStateBackend and 
> org.apache.flink.runtime.state.storage.FileSystemCheckpointStorage.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (HUDI-3607) Support backend switch in HoodieFlinkStreamer

2022-03-11 Thread Xianghu Wang (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-3607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504872#comment-17504872
 ] 

Xianghu Wang edited comment on HUDI-3607 at 3/11/22, 11:39 AM:
---

hi [~liufangqi] 

The reasion you can't assign it to yourself is that you don't have the 
contribution permission yet, pealse apply it.

you can refer this:

[https://mp.weixin.qq.com/s?__biz=MzIyMzQ0NjA0MQ==&mid=2247484782&idx=3&sn=e1c121710c14680e0bf69349eb68424c&chksm=e81f5018df68d90e81b98beb10ff47bc0217774799126abb5ac4a04e25ebb26cf0342aef7016&token=1688466117&lang=zh_CN#rd]

By the way, you can push your patch first

 


was (Author: wangxinghu):
hi [~liufangqi] 

The reasion you can't assign it to yourself is that you don't have the 
contribution permission yet, pealse apply it.

you can refer this:

[https://mp.weixin.qq.com/s?__biz=MzIyMzQ0NjA0MQ==&mid=2247484782&idx=3&sn=e1c121710c14680e0bf69349eb68424c&chksm=e81f5018df68d90e81b98beb10ff47bc0217774799126abb5ac4a04e25ebb26cf0342aef7016&token=1688466117&lang=zh_CN#rd]

By the way, you still can push your patch first

 

> Support backend switch in HoodieFlinkStreamer
> -
>
> Key: HUDI-3607
> URL: https://issues.apache.org/jira/browse/HUDI-3607
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: 刘方奇
>Priority: Major
>
> Now, HoodieFlinkStreamer utility only support one backend - FsStateBackend.
> I think it's not flexible for the application configuration. Could we make 
> backend configurable?
> Moreover, for flink version 1.14, FsStateBackend is deprecated in favor of 
> org.apache.flink.runtime.state.hashmap.HashMapStateBackend and 
> org.apache.flink.runtime.state.storage.FileSystemCheckpointStorage.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[GitHub] [hudi] codope commented on a change in pull request #4640: [HUDI-3225] [RFC-45] for async metadata indexing

2022-03-11 Thread GitBox



codope commented on a change in pull request #4640:
URL: https://github.com/apache/hudi/pull/4640#discussion_r824640318



##
File path: rfc/rfc-45/rfc-45.md
##
@@ -0,0 +1,264 @@
+
+
+# RFC-45: Asynchronous Metadata Indexing
+
+## Proposers
+
+- @codope
+- @manojpec
+
+## Approvers
+
+- @nsivabalan
+- @vinothchandar
+
+## Status
+
+JIRA: [HUDI-2488](https://issues.apache.org/jira/browse/HUDI-2488)
+
+## Abstract
+
+Metadata indexing (aka metadata bootstrapping) is the process of creation of 
one
+or more metadata-based indexes, e.g. data partitions to files index, that is
+stored in Hudi metadata table. Currently, the metadata table (referred as MDT
+hereafter) supports single partition which is created synchronously with the
+corresponding data table, i.e. commits are first applied to metadata table
+followed by data table. Our goal for MDT is to support multiple partitions to
+boost the performance of existing index and records lookup. However, the
+synchronous manner of metadata indexing is not very scalable as we add more
+partitions to the MDT because the regular writers (writing to the data table)
+have to wait until the MDT commit completes. In this RFC, we propose a design 
to
+support asynchronous metadata indexing.
+
+## Background
+
+We can read more about the MDT design
+in 
[RFC-15](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements)
+. Here is a quick summary of the current state (Hudi v0.10.1). MDT is an
+internal Merge-on-Read (MOR) table that has a single partition called `files`
+which stores the data partitions to files index that is used in file listing.
+MDT is co-located with the data table (inside `.hoodie/metadata` directory 
under
+the basepath). In order to handle multi-writer scenario, users configure lock
+provider and only one writer can access MDT in read-write mode. Hence, any 
write
+to MDT is guarded by the data table lock. This ensures only one write is
+committed to MDT at any point in time and thus guarantees serializability.
+However, locking overhead adversely affects the write throughput and will reach
+its scalability limits as we add more partitions to the MDT.
+
+## Goals
+
+- Support indexing one or more partitions in MDT while regular writers and 
table
+  services (such as cleaning or compaction) are in progress.
+- Locking to be as lightweight as possible.
+- Keep required config changes to a minimum to simplify deployment / upgrade in
+  production.
+- Do not require specific ordering of how writers and table service pipelines
+  need to be upgraded / restarted.
+- If an external long-running process is being used to initialize the index, 
the
+  process should be made idempotent so it can handle errors from previous runs.
+- To re-initialize the index, make it as simple as running the external
+  initialization process again without having to change configs.
+
+## Implementation
+
+### A new Hudi action: INDEX
+
+We introduce a new action `index` which will denote the index building process,
+the mechanics of which is as follows:
+
+1. From an external process, users can issue a CREATE INDEX or similar 
statement
+   to trigger indexing for an existing table.
+1. This will schedule INDEX action and add
+   a `.index.requested` to the timeline, which contains the
+   indexing plan. Index scheduling will also initialize the filegroup for
+   the partitions for which indexing is planned.
+2. From here on, the index building process will continue to build an index
+   up to instant time `t`, where `t` is the latest completed instant time 
on
+   the timeline without any
+   "holes" i.e. no pending async operations prior to it.
+3. The indexing process will write these out as base files within the
+   corresponding metadata partition. A metadata partition cannot be used if
+   there is any pending indexing action against it. As and when indexing is
+   completed for a partition, then table config (`hoodie.properties`) will
+   be updated to indicate that partition is available for reads or
+   synchronous updates. Hudi table config will be the source of truth for
+   the current state of metadata index.
+
+2. Any inflight writers (i.e. with instant time `t'` > `t`)  will check for any
+   new indexing request on the timeline prior to preparing to commit.
+1. Such writers will proceed to additionally add log entries corresponding
+   to each such indexing request into the metadata partition.
+2. There is always a TOCTOU issue here, where the inflight writer may not
+   see an indexing request that was just added and proceed to commit 
without
+   that. We will correct this during indexing action completion. In the
+   average case, this may not happen and the design has liveness.
+
+3. When the indexing process is about to complete (i.e. indexing upto
+   instant `t` is done but before completing indexing commit), it will ch

[GitHub] [hudi] vinothchandar commented on pull request #4012: [HUDI-2777] Data import performance deteriorates because multiple Spark jobs are started when data is written to disks.

2022-03-11 Thread GitBox



vinothchandar commented on pull request #4012:
URL: https://github.com/apache/hudi/pull/4012#issuecomment-1065043222


   @xushiyan @nsivabalan can we get to the bottom of this issue? and summarize 
what we find. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codope commented on a change in pull request #4640: [HUDI-3225] [RFC-45] for async metadata indexing

2022-03-11 Thread GitBox



codope commented on a change in pull request #4640:
URL: https://github.com/apache/hudi/pull/4640#discussion_r824647337



##
File path: rfc/rfc-45/rfc-45.md
##
@@ -0,0 +1,229 @@
+
+
+# RFC-45: Asynchronous Metadata Indexing
+
+## Proposers
+
+- @codope
+- @manojpec
+
+## Approvers
+
+- @nsivabalan
+- @vinothchandar
+
+## Status
+
+JIRA: [HUDI-2488](https://issues.apache.org/jira/browse/HUDI-2488)
+
+## Abstract
+
+Metadata indexing (aka metadata bootstrapping) is the process of creation of 
one
+or more metadata-based indexes, e.g. data partitions to files index, that is
+stored in Hudi metadata table. Currently, the metadata table (referred as MDT
+hereafter) supports single partition which is created synchronously with the
+corresponding data table, i.e. commits are first applied to metadata table
+followed by data table. Our goal for MDT is to support multiple partitions to
+boost the performance of existing index and records lookup. However, the
+synchronous manner of metadata indexing is not very scalable as we add more
+partitions to the MDT because the regular writers (writing to the data table)
+have to wait until the MDT commit completes. In this RFC, we propose a design 
to
+support asynchronous metadata indexing.
+
+## Background
+
+We can read more about the MDT design
+in 
[RFC-15](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements)
+. Here is a quick summary of the current state (Hudi v0.10.1). MDT is an
+internal Merge-on-Read (MOR) table that has a single partition called `files`
+which stores the data partitions to files index that is used in file listing.
+MDT is co-located with the data table (inside `.hoodie/metadata` directory 
under
+the basepath). In order to handle multi-writer scenario, users configure lock
+provider and only one writer can access MDT in read-write mode. Hence, any 
write
+to MDT is guarded by the data table lock. This ensures only one write is
+committed to MDT at any point in time and thus guarantees serializability.
+However, locking overhead adversely affects the write throughput and will reach
+its scalability limits as we add more partitions to the MDT.
+
+## Goals
+
+- Support indexing one or more partitions in MDT while regular writers and 
table
+  services (such as cleaning or compaction) are in progress.
+- Locking to be as lightweight as possible.
+- Keep required config changes to a minimum to simplify deployment / upgrade in
+  production.
+- Do not require specific ordering of how writers and table service pipelines
+  need to be upgraded / restarted.
+- If an external long-running process is being used to initialize the index, 
the
+  process should be made idempotent so it can handle errors from previous runs.
+- To re-initialize the index, make it as simple as running the external
+  initialization process again without having to change configs.
+
+## Implementation
+
+### A new Hudi action: INDEX
+
+We introduce a new action `index` which will denote the index building process,
+the mechanics of which is as follows:
+
+1. From an external process, users can issue a CREATE INDEX or similar 
statement
+   to trigger indexing for an existing table.
+1. This will add a `.index.requested` to the timeline, which
+   contains the indexing plan.
+2. From here on, the index building process will continue to build an index

Review comment:
   Yes we can do that and can avoid little serde cost. It can also ease 
debugging. However, i should point out that index action will be written on the 
data timeline as it will be known to the user.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vinothchandar commented on a change in pull request #4264: [HUDI-2875] Make HoodieParquetWriter Thread safe and memory executor …

2022-03-11 Thread GitBox



vinothchandar commented on a change in pull request #4264:
URL: https://github.com/apache/hudi/pull/4264#discussion_r824648413



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/queue/BoundedInMemoryExecutor.java
##
@@ -47,7 +48,7 @@
 public class BoundedInMemoryExecutor {
 
   private static final Logger LOG = 
LogManager.getLogger(BoundedInMemoryExecutor.class);
-
+  private static final long TERMINATE_WAITING_TIME = 60L;

Review comment:
   rename: TERMINATE_TIMEOUT_SECS ?

##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/bootstrap/ParquetBootstrapMetadataHandler.java
##
@@ -80,14 +80,15 @@ void executeBootstrap(HoodieBootstrapHandle 
bootstrapHandle,
 HoodieRecord rec = new HoodieAvroRecord(new HoodieKey(recKey, 
partitionPath), payload);
 return rec;
   }, table.getPreExecuteRunnable());
-  wrapper.execute();
+  executor.execute();
 } catch (Exception e) {
   throw new HoodieException(e);
 } finally {
-  bootstrapHandle.close();
-  if (null != wrapper) {
-wrapper.shutdownNow();
+  reader.close();
+  if (null != executor) {
+executor.shutdownNow();

Review comment:
   any reason we don't awaitTermination here?

##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkMergeHelper.java
##
@@ -77,7 +77,7 @@ public void runMerge(HoodieTable>, 
JavaRDD
   readSchema = mergeHandle.getWriterSchemaWithMetaFields();
 }
 
-BoundedInMemoryExecutor wrapper = null;
+BoundedInMemoryExecutor executor = 
null;

Review comment:
   In future, could we do these renames in a separate PR?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vinothchandar commented on pull request #4880: [HUDI-2752] The MOR DELETE block breaks the event time sequence of CDC

2022-03-11 Thread GitBox



vinothchandar commented on pull request #4880:
URL: https://github.com/apache/hudi/pull/4880#issuecomment-1065050025


   Will review again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4984: [HUDI-3583] Fix MarkerBasedRollbackStrategy NoSuchElementException

2022-03-11 Thread GitBox



hudi-bot removed a comment on pull request #4984:
URL: https://github.com/apache/hudi/pull/4984#issuecomment-1064867430


   
   ## CI report:
   
   * 015f7f0e07d3f0efbd8d3a728f802fc5572a8f52 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6694)
 
   * e30a63cc90f3afbea7ee36c37283f2f21ea7998f UNKNOWN
   * c0a0e141561d1d75150aab046090e1ccd1c9e2c2 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6834)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4984: [HUDI-3583] Fix MarkerBasedRollbackStrategy NoSuchElementException

2022-03-11 Thread GitBox



hudi-bot commented on pull request #4984:
URL: https://github.com/apache/hudi/pull/4984#issuecomment-1065050643


   
   ## CI report:
   
   * e30a63cc90f3afbea7ee36c37283f2f21ea7998f UNKNOWN
   * c0a0e141561d1d75150aab046090e1ccd1c9e2c2 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6834)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4640: [HUDI-3225] [RFC-45] for async metadata indexing

2022-03-11 Thread GitBox



hudi-bot removed a comment on pull request #4640:
URL: https://github.com/apache/hudi/pull/4640#issuecomment-1063194983


   
   ## CI report:
   
   * 7afccec9740814a4bf7f8a3d8a6125c223829d27 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6750)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4640: [HUDI-3225] [RFC-45] for async metadata indexing

2022-03-11 Thread GitBox



hudi-bot commented on pull request #4640:
URL: https://github.com/apache/hudi/pull/4640#issuecomment-1065054306


   
   ## CI report:
   
   * 7afccec9740814a4bf7f8a3d8a6125c223829d27 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6750)
 
   * 62db921e601f9c81c8bd9bb53df771aec6e2de6e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codope commented on a change in pull request #4640: [HUDI-3225] [RFC-45] for async metadata indexing

2022-03-11 Thread GitBox



codope commented on a change in pull request #4640:
URL: https://github.com/apache/hudi/pull/4640#discussion_r824656295



##
File path: rfc/rfc-45/rfc-45.md
##
@@ -0,0 +1,229 @@
+
+
+# RFC-45: Asynchronous Metadata Indexing
+
+## Proposers
+
+- @codope
+- @manojpec
+
+## Approvers
+
+- @nsivabalan
+- @vinothchandar
+
+## Status
+
+JIRA: [HUDI-2488](https://issues.apache.org/jira/browse/HUDI-2488)
+
+## Abstract
+
+Metadata indexing (aka metadata bootstrapping) is the process of creation of 
one
+or more metadata-based indexes, e.g. data partitions to files index, that is
+stored in Hudi metadata table. Currently, the metadata table (referred as MDT
+hereafter) supports single partition which is created synchronously with the
+corresponding data table, i.e. commits are first applied to metadata table
+followed by data table. Our goal for MDT is to support multiple partitions to
+boost the performance of existing index and records lookup. However, the
+synchronous manner of metadata indexing is not very scalable as we add more
+partitions to the MDT because the regular writers (writing to the data table)
+have to wait until the MDT commit completes. In this RFC, we propose a design 
to
+support asynchronous metadata indexing.
+
+## Background
+
+We can read more about the MDT design
+in 
[RFC-15](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements)
+. Here is a quick summary of the current state (Hudi v0.10.1). MDT is an
+internal Merge-on-Read (MOR) table that has a single partition called `files`
+which stores the data partitions to files index that is used in file listing.
+MDT is co-located with the data table (inside `.hoodie/metadata` directory 
under
+the basepath). In order to handle multi-writer scenario, users configure lock
+provider and only one writer can access MDT in read-write mode. Hence, any 
write
+to MDT is guarded by the data table lock. This ensures only one write is
+committed to MDT at any point in time and thus guarantees serializability.
+However, locking overhead adversely affects the write throughput and will reach
+its scalability limits as we add more partitions to the MDT.
+
+## Goals
+
+- Support indexing one or more partitions in MDT while regular writers and 
table
+  services (such as cleaning or compaction) are in progress.
+- Locking to be as lightweight as possible.
+- Keep required config changes to a minimum to simplify deployment / upgrade in
+  production.
+- Do not require specific ordering of how writers and table service pipelines
+  need to be upgraded / restarted.
+- If an external long-running process is being used to initialize the index, 
the
+  process should be made idempotent so it can handle errors from previous runs.
+- To re-initialize the index, make it as simple as running the external
+  initialization process again without having to change configs.
+
+## Implementation
+
+### A new Hudi action: INDEX
+
+We introduce a new action `index` which will denote the index building process,
+the mechanics of which is as follows:
+
+1. From an external process, users can issue a CREATE INDEX or similar 
statement
+   to trigger indexing for an existing table.
+1. This will add a `.index.requested` to the timeline, which

Review comment:
   i'd prefer `index` for brevity, and none of our action end with -ing. 
But let me know if you think `indexing` is more appropriate, i can change it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4640: [HUDI-3225] [RFC-45] for async metadata indexing

2022-03-11 Thread GitBox



hudi-bot commented on pull request #4640:
URL: https://github.com/apache/hudi/pull/4640#issuecomment-1065056424


   
   ## CI report:
   
   * 7afccec9740814a4bf7f8a3d8a6125c223829d27 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6750)
 
   * 62db921e601f9c81c8bd9bb53df771aec6e2de6e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6839)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4640: [HUDI-3225] [RFC-45] for async metadata indexing

2022-03-11 Thread GitBox



hudi-bot removed a comment on pull request #4640:
URL: https://github.com/apache/hudi/pull/4640#issuecomment-1065054306


   
   ## CI report:
   
   * 7afccec9740814a4bf7f8a3d8a6125c223829d27 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6750)
 
   * 62db921e601f9c81c8bd9bb53df771aec6e2de6e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-2871) Decouple metrics dependencies from hudi-client-common

2022-03-11 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2871:
-
Reviewers: Rajesh Mahindra, sivabalan narayanan  (was: Raymond Xu, 
sivabalan narayanan)

> Decouple metrics dependencies from hudi-client-common
> -
>
> Key: HUDI-2871
> URL: https://issues.apache.org/jira/browse/HUDI-2871
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup, dependencies, metrics, writer-core
>Reporter: Vinoth Chandar
>Assignee: Rajesh
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> There are some metrics stuff  - Cloudwatch, graphite, prometheus etc are all 
> pulled in. 
> might be good to break these out into their own modules and include during 
> packaging. This needs some way of reflection based instantiation of the 
> Metrics reporter



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3435) Do not throw exception when instant to rollback does not exist in metadata table active timeline

2022-03-11 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3435:
-
Sprint:   (was: Hudi-Sprint-Mar-07)

> Do not throw exception when instant to rollback does not exist in metadata 
> table active timeline
> 
>
> Key: HUDI-3435
> URL: https://issues.apache.org/jira/browse/HUDI-3435
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> See the stacktrace:
> {code:xml}
> Caused by: org.apache.hudi.exception.HoodieMetadataException: The instant 
> [20220214211929120__deltacommit__COMPLETED] required to sync rollback of 
> 20220214211929120 has been archived
>   at 
> org.apache.hudi.metadata.HoodieTableMetadataUtil.lambda$processRollbackMetadata$10(HoodieTableMetadataUtil.java:224)
>   at java.util.HashMap$Values.forEach(HashMap.java:982)
>   at 
> java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1082)
>   at 
> org.apache.hudi.metadata.HoodieTableMetadataUtil.processRollbackMetadata(HoodieTableMetadataUtil.java:201)
>   at 
> org.apache.hudi.metadata.HoodieTableMetadataUtil.convertMetadataToRecords(HoodieTableMetadataUtil.java:178)
>   at 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.update(HoodieBackedTableMetadataWriter.java:653)
>   at 
> org.apache.hudi.table.action.BaseActionExecutor.lambda$writeTableMetadata$2(BaseActionExecutor.java:77)
>   at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
>   at 
> org.apache.hudi.table.action.BaseActionExecutor.writeTableMetadata(BaseActionExecutor.java:77)
>   at 
> org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.finishRollback(BaseRollbackActionExecutor.java:244)
>   at 
> org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.runRollback(BaseRollbackActionExecutor.java:122)
>   at 
> org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.execute(BaseRollbackActionExecutor.java:144)
>   at 
> org.apache.hudi.table.HoodieFlinkMergeOnReadTable.rollback(HoodieFlinkMergeOnReadTable.java:132)
>   at 
> org.apache.hudi.table.HoodieTable.rollbackInflightCompaction(HoodieTable.java:499)
>   at 
> org.apache.hudi.util.CompactionUtil.lambda$rollbackCompaction$1(CompactionUtil.java:163)
>   at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
>   at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
>   at 
> org.apache.hudi.util.CompactionUtil.rollbackCompaction(CompactionUtil.java:161)
>   at 
> org.apache.hudi.sink.compact.CompactionPlanOperator.open(CompactionPlanOperator.java:73)
>   at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:442)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.restoreGates(StreamTask.java:582)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.call(StreamTaskActionExecutor.java:55)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.executeRestore(StreamTask.java:562)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:647)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:537)
>   at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:759)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3435) Do not throw exception when instant to rollback does not exist in metadata table active timeline

2022-03-11 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3435:
-
Sprint: Cont' improve - 2022/03/7

> Do not throw exception when instant to rollback does not exist in metadata 
> table active timeline
> 
>
> Key: HUDI-3435
> URL: https://issues.apache.org/jira/browse/HUDI-3435
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> See the stacktrace:
> {code:xml}
> Caused by: org.apache.hudi.exception.HoodieMetadataException: The instant 
> [20220214211929120__deltacommit__COMPLETED] required to sync rollback of 
> 20220214211929120 has been archived
>   at 
> org.apache.hudi.metadata.HoodieTableMetadataUtil.lambda$processRollbackMetadata$10(HoodieTableMetadataUtil.java:224)
>   at java.util.HashMap$Values.forEach(HashMap.java:982)
>   at 
> java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1082)
>   at 
> org.apache.hudi.metadata.HoodieTableMetadataUtil.processRollbackMetadata(HoodieTableMetadataUtil.java:201)
>   at 
> org.apache.hudi.metadata.HoodieTableMetadataUtil.convertMetadataToRecords(HoodieTableMetadataUtil.java:178)
>   at 
> org.apache.hudi.metadata.HoodieBackedTableMetadataWriter.update(HoodieBackedTableMetadataWriter.java:653)
>   at 
> org.apache.hudi.table.action.BaseActionExecutor.lambda$writeTableMetadata$2(BaseActionExecutor.java:77)
>   at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
>   at 
> org.apache.hudi.table.action.BaseActionExecutor.writeTableMetadata(BaseActionExecutor.java:77)
>   at 
> org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.finishRollback(BaseRollbackActionExecutor.java:244)
>   at 
> org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.runRollback(BaseRollbackActionExecutor.java:122)
>   at 
> org.apache.hudi.table.action.rollback.BaseRollbackActionExecutor.execute(BaseRollbackActionExecutor.java:144)
>   at 
> org.apache.hudi.table.HoodieFlinkMergeOnReadTable.rollback(HoodieFlinkMergeOnReadTable.java:132)
>   at 
> org.apache.hudi.table.HoodieTable.rollbackInflightCompaction(HoodieTable.java:499)
>   at 
> org.apache.hudi.util.CompactionUtil.lambda$rollbackCompaction$1(CompactionUtil.java:163)
>   at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
>   at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
>   at 
> org.apache.hudi.util.CompactionUtil.rollbackCompaction(CompactionUtil.java:161)
>   at 
> org.apache.hudi.sink.compact.CompactionPlanOperator.open(CompactionPlanOperator.java:73)
>   at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:442)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.restoreGates(StreamTask.java:582)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.call(StreamTaskActionExecutor.java:55)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.executeRestore(StreamTask.java:562)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:647)
>   at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:537)
>   at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:759)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:566)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2777) Data import performance deteriorates because multiple Spark jobs are started when data is written to disks.

2022-03-11 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2777:
-
Sprint: Cont' improve - 2022/03/7

> Data import performance deteriorates because multiple Spark jobs are started 
> when data is written to disks.
> ---
>
> Key: HUDI-2777
> URL: https://issues.apache.org/jira/browse/HUDI-2777
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Affects Versions: 0.9.0
> Environment: hudi 0.9.0
> spark3.1.1
> hive3.1.1
> hadoop3.1.1
>Reporter: liuhe0702
>Assignee: liuhe0702
>Priority: Critical
>  Labels: hudi-on-call, pull-request-available, query-eng, sev:high
> Fix For: 0.11.0
>
>
> If multiple partitions exist and the final result of RDD.isEmpty is true, 
> Spark starts multiple jobs in 5-fold increment mode. As a result, the 
> computing performance deteriorates.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2777) Data import performance deteriorates because multiple Spark jobs are started when data is written to disks.

2022-03-11 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2777:
-
Labels: pull-request-available  (was: pull-request-available query-eng)

> Data import performance deteriorates because multiple Spark jobs are started 
> when data is written to disks.
> ---
>
> Key: HUDI-2777
> URL: https://issues.apache.org/jira/browse/HUDI-2777
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Affects Versions: 0.9.0
> Environment: hudi 0.9.0
> spark3.1.1
> hive3.1.1
> hadoop3.1.1
>Reporter: liuhe0702
>Assignee: liuhe0702
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> If multiple partitions exist and the final result of RDD.isEmpty is true, 
> Spark starts multiple jobs in 5-fold increment mode. As a result, the 
> computing performance deteriorates.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2777) Data import performance deteriorates because multiple Spark jobs are started when data is written to disks.

2022-03-11 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2777:
-
Labels: performance pull-request-available  (was: pull-request-available)

> Data import performance deteriorates because multiple Spark jobs are started 
> when data is written to disks.
> ---
>
> Key: HUDI-2777
> URL: https://issues.apache.org/jira/browse/HUDI-2777
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Affects Versions: 0.9.0
> Environment: hudi 0.9.0
> spark3.1.1
> hive3.1.1
> hadoop3.1.1
>Reporter: liuhe0702
>Assignee: liuhe0702
>Priority: Critical
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
>
> If multiple partitions exist and the final result of RDD.isEmpty is true, 
> Spark starts multiple jobs in 5-fold increment mode. As a result, the 
> computing performance deteriorates.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-2777) Data import performance deteriorates because multiple Spark jobs are started when data is written to disks.

2022-03-11 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2777:
-
Labels: pull-request-available query-eng  (was: hudi-on-call 
pull-request-available query-eng sev:high)

> Data import performance deteriorates because multiple Spark jobs are started 
> when data is written to disks.
> ---
>
> Key: HUDI-2777
> URL: https://issues.apache.org/jira/browse/HUDI-2777
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Affects Versions: 0.9.0
> Environment: hudi 0.9.0
> spark3.1.1
> hive3.1.1
> hadoop3.1.1
>Reporter: liuhe0702
>Assignee: liuhe0702
>Priority: Critical
>  Labels: pull-request-available, query-eng
> Fix For: 0.11.0
>
>
> If multiple partitions exist and the final result of RDD.isEmpty is true, 
> Spark starts multiple jobs in 5-fold increment mode. As a result, the 
> computing performance deteriorates.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[GitHub] [hudi] hudi-bot removed a comment on pull request #4925: [HUDI-3103] Enable MultiTableDeltaStreamer to update a single sink table from multiple source tables

2022-03-11 Thread GitBox



hudi-bot removed a comment on pull request #4925:
URL: https://github.com/apache/hudi/pull/4925#issuecomment-1064883259


   
   ## CI report:
   
   * 7119319af35fb23afa97e058cd2fbfaea18292a1 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6831)
 
   * 82acc7daea301fa4e373c1ff2570a60ca1da6ce3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6835)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4925: [HUDI-3103] Enable MultiTableDeltaStreamer to update a single sink table from multiple source tables

2022-03-11 Thread GitBox



hudi-bot commented on pull request #4925:
URL: https://github.com/apache/hudi/pull/4925#issuecomment-1065073245


   
   ## CI report:
   
   * 82acc7daea301fa4e373c1ff2570a60ca1da6ce3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6835)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] maddy2u commented on issue #4990: [SUPPORT] PySpark Examples for DeltaStreamer using Kinesis

2022-03-11 Thread GitBox



maddy2u commented on issue #4990:
URL: https://github.com/apache/hudi/issues/4990#issuecomment-1065076167


   Hi  - Any update on this? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] maddy2u commented on issue #4202: [SUPPORT] Support Apache Spark 3.2

2022-03-11 Thread GitBox



maddy2u commented on issue #4202:
URL: https://github.com/apache/hudi/issues/4202#issuecomment-1065076555


   It works. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4489: [HUDI-3135] Fix Delete partitions with metadata table and fix show partitions in spark sql

2022-03-11 Thread GitBox



hudi-bot commented on pull request #4489:
URL: https://github.com/apache/hudi/pull/4489#issuecomment-1065080414


   
   ## CI report:
   
   * 44942ace20195bb284b5ce7e792462865255f2e0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6836)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4489: [HUDI-3135] Fix Delete partitions with metadata table and fix show partitions in spark sql

2022-03-11 Thread GitBox



hudi-bot removed a comment on pull request #4489:
URL: https://github.com/apache/hudi/pull/4489#issuecomment-1064902635


   
   ## CI report:
   
   * d17343318be38b5a9b0953004700aa72f4fed689 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6809)
 
   * 44942ace20195bb284b5ce7e792462865255f2e0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6836)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a change in pull request #4877: [HUDI-3457][Stacked on 4818] Refactored Spark DataSource Relations to avoid code duplication

2022-03-11 Thread GitBox



xushiyan commented on a change in pull request #4877:
URL: https://github.com/apache/hudi/pull/4877#discussion_r824682968



##
File path: 
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala
##
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.apache.hudi.HoodieBaseRelation.createBaseFileReader
+import org.apache.hudi.common.table.HoodieTableMetaClient
+import org.apache.spark.sql.SQLContext
+import org.apache.spark.sql.catalyst.InternalRow
+import org.apache.spark.sql.catalyst.expressions.Expression
+import org.apache.spark.sql.execution.datasources._
+import org.apache.spark.sql.sources.{BaseRelation, Filter}
+import org.apache.spark.sql.types.StructType
+
+/**
+ * [[BaseRelation]] implementation only reading Base files of Hudi tables, 
essentially supporting following querying
+ * modes:
+ * 
+ * For COW tables: Snapshot
+ * For MOR tables: Read-optimized
+ * 
+ *
+ * NOTE: The reason this Relation is used in liue of Spark's default 
[[HadoopFsRelation]] is primarily due to the
+ * fact that it injects real partition's path as the value of the partition 
field, which Hudi ultimately persists
+ * as part of the record payload. In some cases, however, partition path might 
not necessarily be equal to the
+ * verbatim value of the partition path field (when custom [[KeyGenerator]] is 
used) therefore leading to incorrect
+ * partition field values being written
+ */
+class BaseFileOnlyRelation(sqlContext: SQLContext,

Review comment:
   if you rename first then commit then make 2nd commit for the changes, it 
should detect as renaming. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4987: [HUDI-3547] Introduce MaxwellSourcePostProcessor to extract data from Maxwell json string

2022-03-11 Thread GitBox



hudi-bot commented on pull request #4987:
URL: https://github.com/apache/hudi/pull/4987#issuecomment-1065096681


   
   ## CI report:
   
   * 84d9028db0242b31b9fcee5cc27a361bd3c987ae UNKNOWN
   * 1077705483682eca8c063671fcacdf73740dacdb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6753)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6837)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4987: [HUDI-3547] Introduce MaxwellSourcePostProcessor to extract data from Maxwell json string

2022-03-11 Thread GitBox



hudi-bot removed a comment on pull request #4987:
URL: https://github.com/apache/hudi/pull/4987#issuecomment-1064987781


   
   ## CI report:
   
   * 84d9028db0242b31b9fcee5cc27a361bd3c987ae UNKNOWN
   * 1077705483682eca8c063671fcacdf73740dacdb Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6753)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6837)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on issue #4943: [SUPPORT] NoClassDefFoundError: org/apache/hudi/org/apache/hadoop/hive/metastore/api/NoSuchObjectException

2022-03-11 Thread GitBox



danny0405 commented on issue #4943:
URL: https://github.com/apache/hudi/issues/4943#issuecomment-1065103698


   Yeah, that is a solution.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4693: [WIP][HUDI-3175][RFC-45] Implement async metadata indexing

2022-03-11 Thread GitBox



hudi-bot removed a comment on pull request #4693:
URL: https://github.com/apache/hudi/pull/4693#issuecomment-1033224161


   
   ## CI report:
   
   * 7920cb15d99cd92ea2a3e6bd515249eb63040772 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5801)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4693: [WIP][HUDI-3175][RFC-45] Implement async metadata indexing

2022-03-11 Thread GitBox



hudi-bot commented on pull request #4693:
URL: https://github.com/apache/hudi/pull/4693#issuecomment-1065109086


   
   ## CI report:
   
   * 7920cb15d99cd92ea2a3e6bd515249eb63040772 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5801)
 
   * e6e3e1612928fb0892d071ec4c3a26e31ce1ff76 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-3058) SqlQueryEqualityPreCommitValidator errors with java.util.ConcurrentModificationException

2022-03-11 Thread zhangyingjie (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504918#comment-17504918
 ] 

zhangyingjie commented on HUDI-3058:


When I run two spark jobs to write to the same hudi table, I encounter the same 
error:

 

py4j.protocol.Py4JJavaError: An error occurred while calling o92.save.
: org.apache.hudi.exception.HoodieWriteConflictException: 
java.util.ConcurrentModificationException: Cannot resolve conflicts for 
overlapping writes
        at 
org.apache.hudi.client.transaction.SimpleConcurrentFileWritesConflictResolutionStrategy.resolveConflict(SimpleConcurrentFileWritesConflictResolutionStrategy.java:102)
        at 
org.apache.hudi.client.utils.TransactionUtils.lambda$resolveWriteConflictIfAny$0(TransactionUtils.java:73)
        at 
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
        at 
java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
        at 
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)
        at 
org.apache.hudi.client.utils.TransactionUtils.resolveWriteConflictIfAny(TransactionUtils.java:67)
        at 
org.apache.hudi.client.SparkRDDWriteClient.preCommit(SparkRDDWriteClient.java:502)
        at 
org.apache.hudi.client.AbstractHoodieWriteClient.commitStats(AbstractHoodieWriteClient.java:196)
        at 
org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:125)
        at 
org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:635)
        at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:286)
        at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:164)
        at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
        at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)
        at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
        at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
        at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
        at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:127)
        at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:126)
        at 
org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:962)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
        at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:767)
        at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
        at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:962)
        at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:414)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:398)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:287)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)

        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.ConcurrentModificationException: Cannot resolve conflicts 
for overlapping writes
        ... 44 more

> SqlQueryEqualityPreCommitValidator errors with 
> java.util.ConcurrentModificationException
> 
>
> Key: HUDI-3058
> URL: https:/

[GitHub] [hudi] hudi-bot removed a comment on pull request #4693: [WIP][HUDI-3175][RFC-45] Implement async metadata indexing

2022-03-11 Thread GitBox



hudi-bot removed a comment on pull request #4693:
URL: https://github.com/apache/hudi/pull/4693#issuecomment-1065109086


   
   ## CI report:
   
   * 7920cb15d99cd92ea2a3e6bd515249eb63040772 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5801)
 
   * e6e3e1612928fb0892d071ec4c3a26e31ce1ff76 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #4693: [WIP][HUDI-3175][RFC-45] Implement async metadata indexing

2022-03-11 Thread GitBox



hudi-bot commented on pull request #4693:
URL: https://github.com/apache/hudi/pull/4693#issuecomment-1065111216


   
   ## CI report:
   
   * 7920cb15d99cd92ea2a3e6bd515249eb63040772 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5801)
 
   * e6e3e1612928fb0892d071ec4c3a26e31ce1ff76 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6840)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codope commented on pull request #4693: [WIP][HUDI-3175][RFC-45] Implement async metadata indexing

2022-03-11 Thread GitBox



codope commented on pull request #4693:
URL: https://github.com/apache/hudi/pull/4693#issuecomment-1065111787


   > Good progress on this one. Getting close to being complete.
   
   Thanks @prashantwason for reviewing. I'll address your comments soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-3593) AsyncClustering failed because of ConcurrentModificationException

2022-03-11 Thread Sagar Sumit (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-3593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504921#comment-17504921
 ] 

Sagar Sumit commented on HUDI-3593:
---

[~shibei] [~Bone An] Could try out [https://github.com/apache/hudi/pull/5013] ? 
I have not seen ConcurrentModificationException with this patch in long-running 
clustering tests. It simply abstracts away the ordering part in TypedProperties 
(the unsafe LinkedHashset) into a separate class and that is just used in 
HoodieTableConfig just before flushing the props to filesystem. I've reverted 
the TypedProperties to use the native thread-safe put apis.   

> AsyncClustering failed because of ConcurrentModificationException
> -
>
> Key: HUDI-3593
> URL: https://issues.apache.org/jira/browse/HUDI-3593
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Hui An
>Assignee: Hui An
>Priority: Major
>  Labels: pull-request-available
> Attachments: Screen Shot 2022-03-10 at 9.53.13 AM.png
>
>
> Following is the stacktrace I met,
> {code:java}
>  ERROR AsyncClusteringService: Clustering executor failed 
> java.util.concurrent.CompletionException: org.apache.spark.SparkException: 
> Task not serializable 
> at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
>  
> at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
>  
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1606)
>  
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1596)
>  
> at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) 
> at 
> java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) 
> at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) 
> at 
> java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
> Caused by: org.apache.spark.SparkException: Task not serializable 
> at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:416)
>  
> at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:406) 
> at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) 
> at org.apache.spark.SparkContext.clean(SparkContext.scala:2467) 
> at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$1(RDD.scala:912) 
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) 
> at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:911) 
> at 
> org.apache.spark.api.java.JavaRDDLike.mapPartitionsWithIndex(JavaRDDLike.scala:103)
>  
> at 
> org.apache.spark.api.java.JavaRDDLike.mapPartitionsWithIndex$(JavaRDDLike.scala:99)
>  
> at 
> org.apache.spark.api.java.AbstractJavaRDDLike.mapPartitionsWithIndex(JavaRDDLike.scala:45)
>  
> at 
> org.apache.hudi.table.action.commit.SparkBulkInsertHelper.bulkInsert(SparkBulkInsertHelper.java:115)
>  
> at 
> org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy.performClusteringWithRecordsRDD(SparkSortAndSizeExecutionStrategy.java:68)
>  
> at 
> org.apache.hudi.client.clustering.run.strategy.MultipleSparkJobExecutionStrategy.lambda$runClusteringForGroupAsync$4(MultipleSparkJobExecutionStrategy.java:175)
>  
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
>  ... 5 more
> Caused by: java.util.ConcurrentModificationException 
> at 
> java.util.LinkedHashMap$LinkedHashIterator.nextNode(LinkedHashMap.java:719) 
> at java.util.LinkedHashMap$LinkedKeyIterator.next(LinkedHashMap.java:742) 
> at java.util.HashSet.writeObject(HashSet.java:287) 
> at sun.reflect.GeneratedMethodAccessor54.invoke(Unknown Source) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.lang.reflect.Method.invoke(Method.java:498) 
> at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) 
> at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) 
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) 
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) 
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) 
> at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) 
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) 
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) 
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) 
> at java.io.ObjectOu

[jira] [Commented] (HUDI-3593) AsyncClustering failed because of ConcurrentModificationException

2022-03-11 Thread Sagar Sumit (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-3593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504923#comment-17504923
 ] 

Sagar Sumit commented on HUDI-3593:
---

Also, I think changing bulk insert api is a significant refactoring which we 
can take as a follow-up.

> AsyncClustering failed because of ConcurrentModificationException
> -
>
> Key: HUDI-3593
> URL: https://issues.apache.org/jira/browse/HUDI-3593
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Hui An
>Assignee: Hui An
>Priority: Major
>  Labels: pull-request-available
> Attachments: Screen Shot 2022-03-10 at 9.53.13 AM.png
>
>
> Following is the stacktrace I met,
> {code:java}
>  ERROR AsyncClusteringService: Clustering executor failed 
> java.util.concurrent.CompletionException: org.apache.spark.SparkException: 
> Task not serializable 
> at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
>  
> at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
>  
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1606)
>  
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.exec(CompletableFuture.java:1596)
>  
> at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) 
> at 
> java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) 
> at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) 
> at 
> java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
> Caused by: org.apache.spark.SparkException: Task not serializable 
> at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:416)
>  
> at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:406) 
> at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) 
> at org.apache.spark.SparkContext.clean(SparkContext.scala:2467) 
> at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$1(RDD.scala:912) 
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) 
> at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:911) 
> at 
> org.apache.spark.api.java.JavaRDDLike.mapPartitionsWithIndex(JavaRDDLike.scala:103)
>  
> at 
> org.apache.spark.api.java.JavaRDDLike.mapPartitionsWithIndex$(JavaRDDLike.scala:99)
>  
> at 
> org.apache.spark.api.java.AbstractJavaRDDLike.mapPartitionsWithIndex(JavaRDDLike.scala:45)
>  
> at 
> org.apache.hudi.table.action.commit.SparkBulkInsertHelper.bulkInsert(SparkBulkInsertHelper.java:115)
>  
> at 
> org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy.performClusteringWithRecordsRDD(SparkSortAndSizeExecutionStrategy.java:68)
>  
> at 
> org.apache.hudi.client.clustering.run.strategy.MultipleSparkJobExecutionStrategy.lambda$runClusteringForGroupAsync$4(MultipleSparkJobExecutionStrategy.java:175)
>  
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
>  ... 5 more
> Caused by: java.util.ConcurrentModificationException 
> at 
> java.util.LinkedHashMap$LinkedHashIterator.nextNode(LinkedHashMap.java:719) 
> at java.util.LinkedHashMap$LinkedKeyIterator.next(LinkedHashMap.java:742) 
> at java.util.HashSet.writeObject(HashSet.java:287) 
> at sun.reflect.GeneratedMethodAccessor54.invoke(Unknown Source) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.lang.reflect.Method.invoke(Method.java:498) 
> at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1140) 
> at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) 
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) 
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) 
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) 
> at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) 
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) 
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) 
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) 
> at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) 
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) 
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) 
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) 
> at java.io.ObjectOutputStream.writeSerialData(Obje

[jira] [Commented] (HUDI-3607) Support backend switch in HoodieFlinkStreamer

2022-03-11 Thread Jira



[ 
https://issues.apache.org/jira/browse/HUDI-3607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504933#comment-17504933
 ] 

刘方奇 commented on HUDI-3607:
---

THX for your reply, I will try your link.

> Support backend switch in HoodieFlinkStreamer
> -
>
> Key: HUDI-3607
> URL: https://issues.apache.org/jira/browse/HUDI-3607
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: flink
>Reporter: 刘方奇
>Priority: Major
>
> Now, HoodieFlinkStreamer utility only support one backend - FsStateBackend.
> I think it's not flexible for the application configuration. Could we make 
> backend configurable?
> Moreover, for flink version 1.14, FsStateBackend is deprecated in favor of 
> org.apache.flink.runtime.state.hashmap.HashMapStateBackend and 
> org.apache.flink.runtime.state.storage.FileSystemCheckpointStorage.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[GitHub] [hudi] hudi-bot commented on pull request #4640: [HUDI-3225] [RFC-45] for async metadata indexing

2022-03-11 Thread GitBox



hudi-bot commented on pull request #4640:
URL: https://github.com/apache/hudi/pull/4640#issuecomment-1065151944


   
   ## CI report:
   
   * 62db921e601f9c81c8bd9bb53df771aec6e2de6e Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6839)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot removed a comment on pull request #4640: [HUDI-3225] [RFC-45] for async metadata indexing

2022-03-11 Thread GitBox



hudi-bot removed a comment on pull request #4640:
URL: https://github.com/apache/hudi/pull/4640#issuecomment-1065056424


   
   ## CI report:
   
   * 7afccec9740814a4bf7f8a3d8a6125c223829d27 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6750)
 
   * 62db921e601f9c81c8bd9bb53df771aec6e2de6e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=6839)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a change in pull request #4640: [HUDI-3225] [RFC-45] for async metadata indexing

2022-03-11 Thread GitBox



nsivabalan commented on a change in pull request #4640:
URL: https://github.com/apache/hudi/pull/4640#discussion_r824747770



##
File path: rfc/rfc-45/rfc-45.md
##
@@ -0,0 +1,281 @@
+
+
+# RFC-45: Asynchronous Metadata Indexing
+
+## Proposers
+
+- @codope
+- @manojpec
+
+## Approvers
+
+- @nsivabalan
+- @vinothchandar
+
+## Status
+
+JIRA: [HUDI-2488](https://issues.apache.org/jira/browse/HUDI-2488)
+
+## Abstract
+
+Metadata indexing (aka metadata bootstrapping) is the process of creation of 
one
+or more metadata-based indexes, e.g. data partitions to files index, that is
+stored in Hudi metadata table. Currently, the metadata table (referred as MDT
+hereafter) supports single partition which is created synchronously with the
+corresponding data table, i.e. commits are first applied to metadata table
+followed by data table. Our goal for MDT is to support multiple partitions to
+boost the performance of existing index and records lookup. However, the
+synchronous manner of metadata indexing is not very scalable as we add more
+partitions to the MDT because the regular writers (writing to the data table)
+have to wait until the MDT commit completes. In this RFC, we propose a design 
to
+support asynchronous metadata indexing.
+
+## Background
+
+We can read more about the MDT design
+in 
[RFC-15](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+15%3A+HUDI+File+Listing+Improvements)
+. Here is a quick summary of the current state (Hudi v0.10.1). MDT is an
+internal Merge-on-Read (MOR) table that has a single partition called `files`
+which stores the data partitions to files index that is used in file listing.
+MDT is co-located with the data table (inside `.hoodie/metadata` directory 
under
+the basepath). In order to handle multi-writer scenario, users configure lock
+provider and only one writer can access MDT in read-write mode. Hence, any 
write
+to MDT is guarded by the data table lock. This ensures only one write is
+committed to MDT at any point in time and thus guarantees serializability.
+However, locking overhead adversely affects the write throughput and will reach
+its scalability limits as we add more partitions to the MDT.
+
+## Goals
+
+- Support indexing one or more partitions in MDT while regular writers and 
table
+  services (such as cleaning or compaction) are in progress.
+- Locking to be as lightweight as possible.
+- Keep required config changes to a minimum to simplify deployment / upgrade in
+  production.
+- Do not require specific ordering of how writers and table service pipelines
+  need to be upgraded / restarted.
+- If an external long-running process is being used to initialize the index, 
the
+  process should be made idempotent so it can handle errors from previous runs.
+- To re-initialize the index, make it as simple as running the external
+  initialization process again without having to change configs.
+
+## Implementation
+
+### A new Hudi action: INDEX
+
+We introduce a new action `index` which will denote the index building process,
+the mechanics of which is as follows:
+
+1. From an external process, users can issue a CREATE INDEX or similar 
statement
+   to trigger indexing for an existing table.
+1. This will schedule INDEX action and add
+   a `.index.requested` to the timeline, which contains the
+   indexing plan. Index scheduling will also initialize the filegroup for
+   the partitions for which indexing is planned. The creation of filegroups
+   will be done within a lock.
+2. From here on, the index building process will continue to build an index
+   up to instant time `t`, where `t` is the latest completed instant time 
on
+   the timeline without any
+   "holes" i.e. no pending async operations prior to it.

Review comment:
   not necessarily async. it could be regular writes too. in case of 
multi-writers, there could be a failed commit waiting to be rolled back. 

##
File path: rfc/rfc-45/rfc-45.md
##
@@ -0,0 +1,281 @@
+
+
+# RFC-45: Asynchronous Metadata Indexing
+
+## Proposers
+
+- @codope
+- @manojpec
+
+## Approvers
+
+- @nsivabalan
+- @vinothchandar
+
+## Status
+
+JIRA: [HUDI-2488](https://issues.apache.org/jira/browse/HUDI-2488)
+
+## Abstract
+
+Metadata indexing (aka metadata bootstrapping) is the process of creation of 
one
+or more metadata-based indexes, e.g. data partitions to files index, that is
+stored in Hudi metadata table. Currently, the metadata table (referred as MDT
+hereafter) supports single partition which is created synchronously with the
+corresponding data table, i.e. commits are first applied to metadata table
+followed by data table. Our goal for MDT is to support multiple partitions to
+boost the performance of existing index and records lookup. However, the
+synchronous manner of metadata indexing is not very scalable as we add more
+partitions to the MDT because the regular writers (writing to the data table)
+have to wait until the

[GitHub] [hudi] nsivabalan commented on a change in pull request #5013: [HUDI-3593] Restore TypedProperties and flush checksum in table config

2022-03-11 Thread GitBox



nsivabalan commented on a change in pull request #5013:
URL: https://github.com/apache/hudi/pull/5013#discussion_r824761454



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java
##
@@ -233,6 +235,23 @@ public HoodieTableConfig(FileSystem fs, String metaPath, 
String payloadClassName
 "hoodie.properties file seems invalid. Please check for left over 
`.updated` files if any, manually copy it to hoodie.properties and retry");
   }
 
+  private static Properties getOrderedPropertiesWithTableChecksum(Properties 
props) {
+LinkedHashMap propsMap = getOrderedPropertiesMap(props);
+propsMap.put(TABLE_CHECKSUM.key(), 
String.valueOf(generateChecksum(props)));
+Properties orderedProps = new Properties();

Review comment:
   sure, makes sense.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] maddy2u opened a new issue #5021: [FEATURE REQUEST] Support for Presto/Athena to make updates to Hudi Tables

2022-03-11 Thread GitBox



maddy2u opened a new issue #5021:
URL: https://github.com/apache/hudi/issues/5021


   Hi,
   
   Within AWS Tech Stack, we seem to have support for Athena to make updates to 
Apache Iceberg tables as seen here 
(https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg.html). 
   
   It would be great if we can do the same with Apache Hudi and make it 
consistent with all the other workflows such that an update from Athena will 
reflect the data in CoW or MoR when reading from Spark or Glue etc.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a change in pull request #5013: [HUDI-3593] Restore TypedProperties and flush checksum in table config

2022-03-11 Thread GitBox



nsivabalan commented on a change in pull request #5013:
URL: https://github.com/apache/hudi/pull/5013#discussion_r824766705



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/config/OrderedProperties.java
##
@@ -0,0 +1,150 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.common.config;
+
+import java.util.Collections;
+import java.util.Enumeration;
+import java.util.HashSet;
+import java.util.LinkedHashSet;
+import java.util.Map;
+import java.util.Objects;
+import java.util.Properties;
+import java.util.Set;
+
+/**
+ * An extension of {@link java.util.Properties} that maintains the order.

Review comment:
   +1 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-3609) Create scala version specific artifacts for hudi-spark-client

2022-03-11 Thread sivabalan narayanan (Jira)

sivabalan narayanan created HUDI-3609:
-

 Summary: Create scala version specific artifacts for 
hudi-spark-client
 Key: HUDI-3609
 URL: https://issues.apache.org/jira/browse/HUDI-3609
 Project: Apache Hudi
  Issue Type: Bug
Reporter: sivabalan narayanan


Create scala version specific artifacts for hudi-spark-client.

 

As of now, we just generate one artifacts irrespective of scala or spark 
version.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3609) Create scala version specific artifacts for hudi-spark-client

2022-03-11 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-3609:
--
Fix Version/s: 0.11.0

> Create scala version specific artifacts for hudi-spark-client
> -
>
> Key: HUDI-3609
> URL: https://issues.apache.org/jira/browse/HUDI-3609
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 0.11.0
>
>
> Create scala version specific artifacts for hudi-spark-client.
>  
> As of now, we just generate one artifacts irrespective of scala or spark 
> version.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[GitHub] [hudi] xushiyan commented on a change in pull request #4877: [HUDI-3457][Stacked on 4818] Refactored Spark DataSource Relations to avoid code duplication

2022-03-11 Thread GitBox



xushiyan commented on a change in pull request #4877:
URL: https://github.com/apache/hudi/pull/4877#discussion_r824767545



##
File path: 
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieDataSourceHelper.scala
##
@@ -65,20 +65,6 @@ object HoodieDataSourceHelper extends PredicateHelper {
 }
   }
 
-  /**
-   * Extract the required schema from [[InternalRow]]
-   */
-  def extractRequiredSchema(

Review comment:
   this not used?

##
File path: 
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala
##
@@ -130,22 +158,110 @@ abstract class HoodieBaseRelation(val sqlContext: 
SQLContext,
* NOTE: DO NOT OVERRIDE THIS METHOD
*/
   override final def buildScan(requiredColumns: Array[String], filters: 
Array[Filter]): RDD[Row] = {
+// NOTE: In case list of requested columns doesn't contain the Primary Key 
one, we
+//   have to add it explicitly so that
+//  - Merging could be performed correctly
+//  - In case 0 columns are to be fetched (for ex, when doing 
{@code count()} on Spark's [[Dataset]],
+//  Spark still fetches all the rows to execute the query correctly
+//
+//   It's okay to return columns that have not been requested by the 
caller, as those nevertheless will be
+//   filtered out upstream
+val fetchedColumns: Array[String] = appendMandatoryColumns(requiredColumns)
+
+val (requiredAvroSchema, requiredStructSchema) =
+  HoodieSparkUtils.getRequiredSchema(tableAvroSchema, fetchedColumns)
+
+val filterExpressions = convertToExpressions(filters)
+val (partitionFilters, dataFilters) = 
filterExpressions.partition(isPartitionPredicate)
+
+val fileSplits = collectFileSplits(partitionFilters, dataFilters)
+
+val partitionSchema = StructType(Nil)
+val tableSchema = HoodieTableSchema(tableStructSchema, 
tableAvroSchema.toString)
+val requiredSchema = HoodieTableSchema(requiredStructSchema, 
requiredAvroSchema.toString)
+
 // Here we rely on a type erasure, to workaround inherited API restriction 
and pass [[RDD[InternalRow]]] back as [[RDD[Row]]]
 // Please check [[needConversion]] scala-doc for more details
-doBuildScan(requiredColumns, filters).asInstanceOf[RDD[Row]]
+composeRDD(fileSplits, partitionSchema, tableSchema, requiredSchema, 
filters).asInstanceOf[RDD[Row]]
   }
 
-  protected def doBuildScan(requiredColumns: Array[String], filters: 
Array[Filter]): RDD[InternalRow]
+  // TODO scala-doc
+  protected def composeRDD(fileSplits: Seq[FileSplit],
+   partitionSchema: StructType,
+   tableSchema: HoodieTableSchema,
+   requiredSchema: HoodieTableSchema,
+   filters: Array[Filter]): HoodieUnsafeRDD
+
+  // TODO scala-doc
+  protected def collectFileSplits(partitionFilters: Seq[Expression], 
dataFilters: Seq[Expression]): Seq[FileSplit]
+
+  protected def listLatestBaseFiles(globPaths: Seq[Path], partitionFilters: 
Seq[Expression], dataFilters: Seq[Expression]): Map[Path, Seq[FileStatus]] = {
+if (globPaths.isEmpty) {
+  val partitionDirs = fileIndex.listFiles(partitionFilters, dataFilters)
+  partitionDirs.map(pd => (getPartitionPath(pd.files.head), 
pd.files)).toMap
+} else {
+  val inMemoryFileIndex = 
HoodieSparkUtils.createInMemoryFileIndex(sparkSession, globPaths)
+  val partitionDirs = inMemoryFileIndex.listFiles(partitionFilters, 
dataFilters)
+
+  val fsView = new HoodieTableFileSystemView(metaClient, timeline, 
partitionDirs.flatMap(_.files).toArray)
+  val latestBaseFiles = 
fsView.getLatestBaseFiles.iterator().asScala.toList.map(_.getFileStatus)
+
+  latestBaseFiles.groupBy(getPartitionPath)
+}
+  }
+
+  protected def convertToExpressions(filters: Array[Filter]): 
Array[Expression] = {
+val catalystExpressions = 
HoodieSparkUtils.convertToCatalystExpressions(filters, tableStructSchema)
+
+val failedExprs = catalystExpressions.zipWithIndex.filter { case (opt, _) 
=> opt.isEmpty }
+if (failedExprs.nonEmpty) {
+  val failedFilters = failedExprs.map(p => filters(p._2))
+  logWarning(s"Failed to convert Filters into Catalyst expressions 
(${failedFilters.map(_.toString)})")
+}
+
+catalystExpressions.filter(_.isDefined).map(_.get).toArray
+  }
+
+  /**
+   * Checks whether given expression only references partition columns
+   * (and involves no sub-query)
+   */
+  protected def isPartitionPredicate(condition: Expression): Boolean = {
+// Validates that the provided names both resolve to the same entity
+val resolvedNameEquals = sparkSession.sessionState.analyzer.resolver
+
+condition.references.forall { r => 
partitionColumns.exists(resolvedNameEquals(r.name, _)) } &&
+  !SubqueryExpression.hasSubquery(condition)
+  }
 
   protected final def appendMandatoryColumns(requestedColumns:

[jira] [Updated] (HUDI-2606) Ensure query engines not access MDT if disabled

2022-03-11 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2606:
-
Story Points: 0.5  (was: 0)

> Ensure query engines not access MDT if disabled
> ---
>
> Key: HUDI-2606
> URL: https://issues.apache.org/jira/browse/HUDI-2606
> Project: Apache Hudi
>  Issue Type: Task
>  Components: metadata, reader-core
>Reporter: sivabalan narayanan
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-3535) Integrate with LinkedIn DataHub

2022-03-11 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-3535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3535:
-
Fix Version/s: 0.11.0

> Integrate with LinkedIn DataHub
> ---
>
> Key: HUDI-3535
> URL: https://issues.apache.org/jira/browse/HUDI-3535
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: meta-sync
>Reporter: Raymond Xu
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

1 2 3 4 >

1 - 100 of 305 matches

Mail list logo