soma1712 opened a new issue, #6249: URL: https://github.com/apache/hudi/issues/6249
**_Tips before filing an issue_** - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? - Join the mailing list to engage in conversations and get faster support at [email protected]. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** A clear and concise description of the problem. **To Reproduce** Steps to reproduce the behavior: 1. 2. 3. 4. **Expected behavior** A clear and concise description of what you expected to happen. **Environment Description** * Hudi version : * Spark version : * Hive version : * Hadoop version : * Storage (HDFS/S3/GCS..) : * Running on Docker? (yes/no) : **Additional context** Add any other context about the problem here. **Stacktrace** ```Add the stacktrace of the error.``` Detailed Notes - We have Incoming Delta transactions from an Oracle based application that are being pushed into S3 endpoint using AWS DMS services. These CDC records are applied as upserts on to already existing Hudi table in a different S3 bucket (Initial Load data). The UPSERTS are happening by running below Spark Submits - spark-submit \ --deploy-mode client \ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ --conf spark.shuffle.service.enabled=true \ --conf spark.default.parallelism=500 \ --conf spark.dynamicAllocation.enabled=true \ --conf spark.dynamicAllocation.initialExecutors=3 \ --conf spark.dynamicAllocation.cachedExecutorIdleTimeout=90s \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf spark.app.name=<table_1> \ --jars /usr/lib/spark/external/lib/spark-avro.jar,/usr/lib/hive/lib/hbase-client.jar /usr/lib/hudi/hudi-utilities-bundle.jar \ --table-type MERGE_ON_READ \ --op UPSERT \ --hoodie-conf hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://localhost:10000 \ --source-ordering-field dms_seq_no \ --props s3://bucket/cdc.properties \ --hoodie-conf hoodie.datasource.hive_sync.database=glue_db \ --target-base-path s3://bucket/table_1 \ --target-table table_1 \ --transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://bucket/ \ --source-class org.apache.hudi.utilities.sources.ParquetDFSSource --enable-sync This table <table1> will be subsequently read with hudi options and joined with other hudi tables to populate the Final Enriched layer. While reading a Hudi table we are facing the ArrayIndexOutOfbound exception. Below are the Hudi props and Spark Submits we execute to read and populate the downstream. hoodie.datasource.write.partitionpath.field= hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator hoodie.datasource.hive_sync.enable=true hoodie.datasource.hive_sync.assume_date_partitioning=false hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor hoodie.parquet.small.file.limit=134217728 hoodie.parquet.max.file.size=1048576000 hoodie.cleaner.policy=KEEP_LATEST_COMMITS hoodie.cleaner.commits.retained=1 hoodie.deltastreamer.transformer.sql=select CASE WHEN Op='D' THEN TRUE ELSE FALSE END AS _hoodie_is_deleted,* from <SRC> hoodie.datasource.hive_sync.support_timestamp=true hoodie.datasource.compaction.async.enable=true hoodie.index.type=BLOOM hoodie.compact.inline=true hoodiecompactionconfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP=5 hoodie.metadata.compact.max.delta.commits=5 hoodie.clean.automatic=true hoodie.clean.async=true hoodie.datasource.hive_sync.table=table_1 hoodie.datasource.write.recordkey.field=table_1_ID spark-submit --deploy-mode client --conf spark.yarn.appMasterEnv.SPARK_HOME=/prod/null --conf spark.executorEnv.SPARK_HOME=/prod/null --conf spark.shuffle.service.enabled=true --jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar s3://pythonscripts/hudi_read.py TaskSetManager: Lost task 32.2 in stage 6.0 (TID 253) on ip-172-31-16-236.ec2.internal, executor 1: java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 1] 22/07/21 15:50:26 INFO TaskSetManager: Starting task 32.3 in stage 6.0 (TID 296, ip-172-31-16-236.ec2.internal, executor 1, partition 32, PROCESS_LOCAL, 8887 bytes) 22/07/21 15:50:26 INFO TaskSetManager: Lost task 33.2 in stage 6.0 (TID 256) on ip-172-31-16-236.ec2.internal, executor 1: java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 2] -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
