haoxie-aws opened a new issue, #10107: URL: https://github.com/apache/hudi/issues/10107
**_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? - Join the mailing list to engage in conversations and get faster support at [email protected]. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** I'm using AWS Glue to create a copy-on-write table and using Athena to query it. Sometimes my query fails with exception like this: > HIVE_UNKNOWN_ERROR: io.trino.hdfs.s3.TrinoS3FileSystem$UnrecoverableS3OperationException: com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: V3ENMZPB79R2VT63; S3 Extended Request ID: 4h1CkwOW74cIHfDEVy1IB33aOxX+LjwV15ekv9kWxXfpqTYAAR6DZMOpFOd9uVimMWwlSudbLuQ=; Proxy: null), S3 Extended Request ID: 4h1CkwOW74cIHfDEVy1IB33aOxX+LjwV15ekv9kWxXfpqTYAAR6DZMOpFOd9uVimMWwlSudbLuQ= (Bucket: <****>, Key: SampleHudi/.hoodie/20231115194414497.replacecommit) This query ran against the "" database, unless qualified by the query. Please post the error message on our [forum ](https://forums.aws.amazon.com/forum.jspa?forumID=242&start=0) or contact [customer support ](https://us-west-2.console.aws.amazon.com/support/home?#/case/create?issueType=technical&serviceCode=amazon-athena&categoryCode=query-related-issue) with Query Id: f8dc7f64-7be6-46cc-a5a1-461ac283daca I checked S3 object version and found the object was deleted during the query. My hypothesis is that the lost replacecommit file was archived by my writer which runs in parallel with my query. Is this a known failure mode? Is there a way to prevent such failures? My writer is using Hudi 0.11.0 and running in AWS Glue. I have created a sample writer code (added in "additional context" section) that can replicate the issue. Also have a conversation in slack: https://apache-hudi.slack.com/archives/C4D716NPQ/p1699577638995159 **To Reproduce** Steps to reproduce the behavior: 1. Run the sample writer code in AWS Glue. 2. In parallel, Run this query every 5s in Athena: `select * from samplehudi as t1, samplehudi as t2 where t1.key = t2.key`. 3. Check the status of Athena queries. I managed to replicate the "NoSuchKey" error once in about 1000 queries. **Expected behavior** A clear and concise description of what you expected to happen. Query should not fail when a commit it sees is archived. **Environment Description** * Hudi version : 0.11.0 * Spark version : 3.1 * Hive version : AWS Glue Catalog * Hadoop version : * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : No **Additional context** Add any other context about the problem here. Sample writer code: ``` import com.amazonaws.services.glue.GlueContext import org.apache.hudi.DataSourceWriteOptions import org.apache.hudi.DataSourceWriteOptions._ import org.apache.hudi.common.config.HoodieMetadataConfig import org.apache.hudi.config.HoodieClusteringConfig._ import org.apache.hudi.config.HoodieCompactionConfig._ import org.apache.hudi.config.HoodieStorageConfig.PARQUET_MAX_FILE_SIZE import org.apache.hudi.config.HoodieWriteConfig.{BULK_INSERT_SORT_MODE, TBL_NAME} import org.apache.spark.rdd.RDD import org.apache.spark.sql.types.{StringType, StructField, StructType} import org.apache.spark.sql.{DataFrame, Row, SaveMode} import org.apache.spark.{SparkConf, SparkContext} import java.time.Instant import java.util import java.util.UUID import scala.collection.JavaConverters object GlueApp { val sparkContext: SparkContext = new SparkContext(new SparkConf()) val glueContext = new GlueContext(sparkContext) val schema: StructType = StructType(Array( StructField("key", StringType, false), StructField("value", StringType, true), StructField("ts", StringType, true) )) val tableName = "SampleHudi" def main(sysArgs: Array[String]): Unit = { for (_ <- 0 to 1000) { loop } } def generateDataFrame: DataFrame = { val data: java.util.List[Row] = new util.ArrayList[Row] for (i <- 0 to 100000) { data.add(Row(UUID.randomUUID().toString, i.toString, Instant.now().toString)) } val rdd: RDD[Row] = glueContext.sparkContext.parallelize( JavaConverters.asScalaIteratorConverter(data.iterator()).asScala.toSeq ) glueContext.sparkSession.createDataFrame(rdd, schema) } def loop(): Unit = { generateDataFrame.write .format("org.apache.hudi") .option(TBL_NAME.key(), tableName) .option(DataSourceWriteOptions.TABLE_TYPE.key(), DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL) .option(RECORDKEY_FIELD.key(), "key") .option(HIVE_SYNC_ENABLED.key(), "true") .option(HIVE_DATABASE.key(), "default") .option(HIVE_TABLE.key(), tableName) .option(HIVE_TABLE_PROPERTIES.key(), tableName) .option(HIVE_USE_JDBC.key(), "false") .option(HIVE_SUPPORT_TIMESTAMP_TYPE.key(), "true") .option(OPERATION.key(), UPSERT_OPERATION_OPT_VAL) .option(PARQUET_MAX_FILE_SIZE.key(), "20971520") // 20MB .option(PARQUET_SMALL_FILE_LIMIT.key(), "0") .option(MIN_COMMITS_TO_KEEP.key(), "11") .option(MAX_COMMITS_TO_KEEP.key(), "12") .option(BULK_INSERT_SORT_MODE.key(), "NONE") .option(INLINE_CLUSTERING.key(), "true") .option(INLINE_CLUSTERING_MAX_COMMITS.key(), "2") .option(PLAN_STRATEGY_SMALL_FILE_LIMIT.key(), "20971520") // 20MB .option(PLAN_STRATEGY_TARGET_FILE_MAX_BYTES.key(), "31457280") // 30MB .option(HoodieMetadataConfig.ENABLE.key(), "true") .mode(SaveMode.Append) .save(s"s3://<****>/$tableName") } } ``` **Stacktrace** ```Add the stacktrace of the error.``` Full error message is in description. No stacktrace available. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
