haoxie-aws opened a new issue, #10107:
URL: https://github.com/apache/hudi/issues/10107

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
[email protected].
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   I'm using AWS Glue to create a copy-on-write table and using Athena to query 
it. Sometimes my query fails with exception like this:
   
   > HIVE_UNKNOWN_ERROR: 
io.trino.hdfs.s3.TrinoS3FileSystem$UnrecoverableS3OperationException: 
com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does not 
exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request 
ID: V3ENMZPB79R2VT63; S3 Extended Request ID: 
4h1CkwOW74cIHfDEVy1IB33aOxX+LjwV15ekv9kWxXfpqTYAAR6DZMOpFOd9uVimMWwlSudbLuQ=; 
Proxy: null), S3 Extended Request ID: 
4h1CkwOW74cIHfDEVy1IB33aOxX+LjwV15ekv9kWxXfpqTYAAR6DZMOpFOd9uVimMWwlSudbLuQ= 
(Bucket: <****>, Key: SampleHudi/.hoodie/20231115194414497.replacecommit)
   This query ran against the "" database, unless qualified by the query. 
Please post the error message on our [forum 
](https://forums.aws.amazon.com/forum.jspa?forumID=242&start=0)
   or contact [customer support 
](https://us-west-2.console.aws.amazon.com/support/home?#/case/create?issueType=technical&serviceCode=amazon-athena&categoryCode=query-related-issue)
   with Query Id: f8dc7f64-7be6-46cc-a5a1-461ac283daca
   
   I checked S3 object version and found the object was deleted during the 
query. My hypothesis is that the lost replacecommit file was archived by my 
writer which runs in parallel with my query. Is this a known failure mode? Is 
there a way to prevent such failures?
   
   My writer is using Hudi 0.11.0 and running in AWS Glue. I have created a 
sample writer code (added in "additional context" section) that can replicate 
the issue.
   
   Also have a conversation in slack: 
https://apache-hudi.slack.com/archives/C4D716NPQ/p1699577638995159
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Run the sample writer code in AWS Glue.
   2. In parallel, Run this query every 5s in Athena: `select * from samplehudi 
as t1, samplehudi as t2 where t1.key = t2.key`.
   3. Check the status of Athena queries. I managed to replicate the 
"NoSuchKey" error once in about 1000 queries.
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   Query should not fail when a commit it sees is archived.
   
   **Environment Description**
   
   * Hudi version : 0.11.0
   
   * Spark version : 3.1
   
   * Hive version : AWS Glue Catalog
   
   * Hadoop version : 
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   Sample writer code:
   ```
   import com.amazonaws.services.glue.GlueContext
   import org.apache.hudi.DataSourceWriteOptions
   import org.apache.hudi.DataSourceWriteOptions._
   import org.apache.hudi.common.config.HoodieMetadataConfig
   import org.apache.hudi.config.HoodieClusteringConfig._
   import org.apache.hudi.config.HoodieCompactionConfig._
   import org.apache.hudi.config.HoodieStorageConfig.PARQUET_MAX_FILE_SIZE
   import org.apache.hudi.config.HoodieWriteConfig.{BULK_INSERT_SORT_MODE, 
TBL_NAME}
   import org.apache.spark.rdd.RDD
   import org.apache.spark.sql.types.{StringType, StructField, StructType}
   import org.apache.spark.sql.{DataFrame, Row, SaveMode}
   import org.apache.spark.{SparkConf, SparkContext}
   
   import java.time.Instant
   import java.util
   import java.util.UUID
   import scala.collection.JavaConverters
   
   
   object GlueApp {
     val sparkContext: SparkContext = new SparkContext(new SparkConf())
     val glueContext = new GlueContext(sparkContext)
     val schema: StructType = StructType(Array(
       StructField("key", StringType, false),
       StructField("value", StringType, true),
       StructField("ts", StringType, true)
     ))
     val tableName = "SampleHudi"
   
     def main(sysArgs: Array[String]): Unit = {
       for (_ <- 0 to 1000) {
         loop
       }
     }
   
     def generateDataFrame: DataFrame = {
       val data: java.util.List[Row] = new util.ArrayList[Row]
       for (i <- 0 to 100000) {
         data.add(Row(UUID.randomUUID().toString, i.toString, 
Instant.now().toString))
       }
   
       val rdd: RDD[Row] = glueContext.sparkContext.parallelize(
         JavaConverters.asScalaIteratorConverter(data.iterator()).asScala.toSeq
       )
       glueContext.sparkSession.createDataFrame(rdd, schema)
     }
   
     def loop(): Unit = {
       generateDataFrame.write
         .format("org.apache.hudi")
         .option(TBL_NAME.key(), tableName)
         .option(DataSourceWriteOptions.TABLE_TYPE.key(), 
DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL)
         .option(RECORDKEY_FIELD.key(), "key")
         .option(HIVE_SYNC_ENABLED.key(), "true")
         .option(HIVE_DATABASE.key(), "default")
         .option(HIVE_TABLE.key(), tableName)
         .option(HIVE_TABLE_PROPERTIES.key(), tableName)
         .option(HIVE_USE_JDBC.key(), "false")
         .option(HIVE_SUPPORT_TIMESTAMP_TYPE.key(), "true")
         .option(OPERATION.key(), UPSERT_OPERATION_OPT_VAL)
         .option(PARQUET_MAX_FILE_SIZE.key(), "20971520") // 20MB
         .option(PARQUET_SMALL_FILE_LIMIT.key(), "0")
         .option(MIN_COMMITS_TO_KEEP.key(), "11")
         .option(MAX_COMMITS_TO_KEEP.key(), "12")
         .option(BULK_INSERT_SORT_MODE.key(), "NONE")
         .option(INLINE_CLUSTERING.key(), "true")
         .option(INLINE_CLUSTERING_MAX_COMMITS.key(), "2")
         .option(PLAN_STRATEGY_SMALL_FILE_LIMIT.key(), "20971520") // 20MB
         .option(PLAN_STRATEGY_TARGET_FILE_MAX_BYTES.key(), "31457280") // 30MB
         .option(HoodieMetadataConfig.ENABLE.key(), "true")
         .mode(SaveMode.Append)
         .save(s"s3://<****>/$tableName")
     }
   }
   ```
   
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   Full error message is in description. No stacktrace available.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to