[GitHub] [hudi] soma1712 opened a new issue, #6249: [SUPPORT] - Hudi Read on a MOR table is failing with ArrayIndexOutOfBound exception

GitBox Fri, 29 Jul 2022 11:46:27 -0700


soma1712 opened a new issue, #6249:
URL: https://github.com/apache/hudi/issues/6249


   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
[email protected].
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.
   2.
   3.
   4.
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :
   
   * Spark version :
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) :
   
   * Running on Docker? (yes/no) :
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   Detailed Notes - 
   
   We have Incoming Delta transactions from an Oracle based application that 
are being pushed into S3 endpoint using AWS DMS services. These CDC records are 
applied as upserts on to already existing Hudi table in a different S3 bucket 
(Initial Load data). The UPSERTS are happening by running below Spark Submits -
   
   spark-submit \
   --deploy-mode client \
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
   --conf spark.shuffle.service.enabled=true \
   --conf spark.default.parallelism=500 \
   --conf spark.dynamicAllocation.enabled=true \
   --conf spark.dynamicAllocation.initialExecutors=3 \
   --conf spark.dynamicAllocation.cachedExecutorIdleTimeout=90s \
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
   --conf spark.app.name=<table_1> \
   --jars 
/usr/lib/spark/external/lib/spark-avro.jar,/usr/lib/hive/lib/hbase-client.jar 
/usr/lib/hudi/hudi-utilities-bundle.jar \
   --table-type MERGE_ON_READ \
   --op UPSERT \
   --hoodie-conf 
hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://localhost:10000 \
   --source-ordering-field dms_seq_no \
   --props s3://bucket/cdc.properties \
   --hoodie-conf hoodie.datasource.hive_sync.database=glue_db \
   --target-base-path s3://bucket/table_1 \
   --target-table table_1 \
   --transformer-class 
org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
   --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://bucket/ \
   --source-class org.apache.hudi.utilities.sources.ParquetDFSSource 
--enable-sync
   
   This table <table1> will be subsequently read with hudi options and joined 
with other hudi tables to populate the Final Enriched layer. While reading a 
Hudi table we are facing the ArrayIndexOutOfbound exception.
   
   Below are the Hudi props and Spark Submits we execute to read and populate 
the downstream.
   
   hoodie.datasource.write.partitionpath.field=
   
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator
   hoodie.datasource.hive_sync.enable=true
   hoodie.datasource.hive_sync.assume_date_partitioning=false
   
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor
   hoodie.parquet.small.file.limit=134217728
   hoodie.parquet.max.file.size=1048576000
   hoodie.cleaner.policy=KEEP_LATEST_COMMITS
   hoodie.cleaner.commits.retained=1
   hoodie.deltastreamer.transformer.sql=select CASE WHEN Op='D' THEN TRUE ELSE 
FALSE END AS _hoodie_is_deleted,* from <SRC>
   hoodie.datasource.hive_sync.support_timestamp=true
   hoodie.datasource.compaction.async.enable=true
   hoodie.index.type=BLOOM
   hoodie.compact.inline=true
   hoodiecompactionconfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP=5
   hoodie.metadata.compact.max.delta.commits=5
   hoodie.clean.automatic=true
   hoodie.clean.async=true
   hoodie.datasource.hive_sync.table=table_1
   hoodie.datasource.write.recordkey.field=table_1_ID
   
   spark-submit --deploy-mode client --conf 
spark.yarn.appMasterEnv.SPARK_HOME=/prod/null --conf 
spark.executorEnv.SPARK_HOME=/prod/null --conf 
spark.shuffle.service.enabled=true --jars 
/usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar 
s3://pythonscripts/hudi_read.py
   
   
   
   TaskSetManager: Lost task 32.2 in stage 6.0 (TID 253) on 
ip-172-31-16-236.ec2.internal, executor 1: 
java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 1]
   22/07/21 15:50:26 INFO TaskSetManager: Starting task 32.3 in stage 6.0 (TID 
296, ip-172-31-16-236.ec2.internal, executor 1, partition 32, PROCESS_LOCAL, 
8887 bytes)
   22/07/21 15:50:26 INFO TaskSetManager: Lost task 33.2 in stage 6.0 (TID 256) 
on ip-172-31-16-236.ec2.internal, executor 1: 
java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 2]
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] soma1712 opened a new issue, #6249: [SUPPORT] - Hudi Read on a MOR table is failing with ArrayIndexOutOfBound exception

Reply via email to