Limess opened a new issue #4043:
URL: https://github.com/apache/hudi/issues/4043


   **Describe the problem you faced**
   
   We've been successfully writing to a Hudi table for a couple of weeks with 
the following processes:
   
   1. A deltastreamer instance ran hourly which reads from parquet written by 
AWS Kinesis Firehose, using checkpointing
   2. A separate deltastreamer instance which runs nightly and backfills 
updates across end table
   
   We made a change to support deletions in (2), using `_hoodie_deleted_date`, 
this resulted in the column `_hoodie_deleted_date` being added to the end table 
schema.
   
   At this point, our writer from (1) started failing with the following 
stacktrace:
   
   ```
   java.lang.ClassCastException: java.lang.String cannot be cast to 
org.apache.spark.sql.Row
   at 
org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$15(AvroConversionHelper.scala:358)
   at 
org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$15(AvroConversionHelper.scala:362)
   at 
org.apache.hudi.HoodieSparkUtils$.$anonfun$createRddInternal$3(HoodieSparkUtils.scala:138)
   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
   at scala.collection.Iterator$SliceIterator.next(Iterator.scala:271)
   at scala.collection.Iterator.foreach(Iterator.scala:941)
   at scala.collection.Iterator.foreach$(Iterator.scala:941)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
   at org.apache.spark.rdd.RDD.$anonfun$take$2(RDD.scala:1449)
   at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2281)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
   at org.apache.spark.scheduler.Task.run(Task.scala:131)
   at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   at java.lang.Thread.run(Thread.java:748) 
   ```
   
   ## Things we've tried:
   
   1. Adding the column to AWS Glue so that it is written out in new records 
written by Kinesis Firehose (as a `null` value), and removing all data without 
the column
   2. Downgrading Apache Spark to 3.0.1
   3. Rewriting the source parquet using Spark and using this as input (incase 
there is some weirdness in the schema or encoding)
   4. Unsetting `hoodie.datasource.write.reconcile.schema=true` after adding 
the null column to records
   
   All of these still produce the same error as above.
    
   **To Reproduce**
   
   Unsure on the steps/root cause
   
   **Expected behavior**
   
   We'd expect the data to write correctly, even without adding the column in 
the realtime (1) writer, as we are setting 
`hoodie.datasource.write.reconcile.schema=true`.
   
   **Environment Description**
   
   Both EMR 6.4.0, 6.2.1
   
   * Hudi version: 0.9.0
   
   * Spark version :
   
   Tested with 
   3.1.2, 3.0.1
   
   * Hive version :
   
   Hive 3.1.2
   
   * Hadoop version :
   
   Amazon 3.2.1
   
   * Storage (HDFS/S3/GCS..) :
   
   S3
   
   * Running on Docker? (yes/no) :
   
   no
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   
   ### Additional details
   
   Deltastreamer config:
   
   ```
   21/11/19 12:03:58 INFO HoodieDeltaStreamer: Creating delta streamer with 
configs:
   hoodie.avro.schema.validate: true
   hoodie.bloom.index.prune.by.ranges: false
   hoodie.bulkinsert.shuffle.parallelism: 275
   hoodie.cleaner.commits.retained: 1
   hoodie.cleaner.policy.failed.writes: LAZY
   hoodie.datasource.hive_sync.database: articles
   hoodie.datasource.hive_sync.enable: true
   hoodie.datasource.hive_sync.jdbcurl: jdbc:hive2://10.0.69.218:10000
   hoodie.datasource.hive_sync.partition_extractor_class: 
org.apache.hudi.hive.MultiPartKeysValueExtractor
   hoodie.datasource.hive_sync.partition_fields: story_published_partition_date
   hoodie.datasource.hive_sync.support_timestamp: true
   hoodie.datasource.hive_sync.table: articles_hudi_copy_on_write
   hoodie.datasource.write.drop.partition.columns: false
   hoodie.datasource.write.hive_style_partitioning: true
   hoodie.datasource.write.keygenerator.class: 
org.apache.hudi.keygen.TimestampBasedKeyGenerator
   hoodie.datasource.write.partitionpath.field: story_published_partition_date
   hoodie.datasource.write.precombine.field: version
   hoodie.datasource.write.reconcile.schema: true
   hoodie.datasource.write.recordkey.field: id
   hoodie.deltastreamer.keygen.timebased.input.dateformat: 
yyyy-MM-dd'T'HH:mm:ssZ,yyyy-MM-dd'T'HH:mm:ss.SSSZ
   hoodie.deltastreamer.keygen.timebased.input.dateformat.list.delimiter.regex: 
,
   hoodie.deltastreamer.keygen.timebased.output.dateformat: yyyy-MM-dd
   hoodie.deltastreamer.keygen.timebased.output.timezone: UTC
   hoodie.deltastreamer.keygen.timebased.timestamp.type: DATE_STRING
   hoodie.deltastreamer.source.dfs.root: 
s3://<bucket>/realtime_out_identity_parquet_test/
   hoodie.deltastreamer.transformer.sql.file: 
/etc/hudi/conf/schema/documents_schema.sql
   hoodie.insert.shuffle.parallelism: 275
   hoodie.metrics.on: true
   hoodie.metrics.reporter.type: PROMETHEUS
   hoodie.table.name: articles_hudi_copy_on_write
   hoodie.upsert.shuffle.parallelism: 275
   hoodie.write.concurrency.mode: optimistic_concurrency_control
   hoodie.write.lock.provider: 
org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider
   hoodie.write.lock.zookeeper.base_path: /hudi
   hoodie.write.lock.zookeeper.lock_key: articles_hudi_copy_on_write
   hoodie.write.lock.zookeeper.port: 2181
   hoodie.write.lock.zookeeper.url: 10.0.69.218
   hoodie.write.markers.type: TIMELINE_SERVER_BASED
   ```
   
   Schema which results in failures:
   
   ```
   
   ############ file meta data ############
   created_by: parquet-mr version 1.8.1 (build 
4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)
   num_columns: 96
   num_rows: 1709
   num_row_groups: 1
   format_version: 1.0
   serialized_size: 41280
   
   
   ############ Columns ############
   id
   version
   aggregation_id
   area_on_page
   article_cursor
   article_deduplication_id
   article_earliest_published_date
   author
   canonical_source_id
   country
   region
   subregion
   canonical_source_name
   reach_origin
   reach_provider
   source_id
   type
   value
   content
   deleted_date
   document_type
   eclips_web_url
   embargoed_until_date
   offset
   overlapping
   position
   rule_based_entity
   compound
   neg
   neu
   pos
   signal_type
   surface_form
   wiki_title
   salience
   salience_rank
   wiki_title
   feed
   format_version
   ingestion_id
   replay_time
   replay_type
   journalist_id
   journalist_name
   journalistic_quality
   language
   licence_id
   media_type
   array_element
   content
   summary
   title
   old_article_ids
   original_id
   original_source_id
   original_source_name
   original_url
   page
   page_section
   pdf_url
   processed_date
   podcast_link
   copyright
   formatted_content
   nla_publisher
   provider_hosted_url
   publication_time
   end_date
   start_date
   station_id
   asset_id
   partner_id
   published_date
   end
   start
   text
   received
   signal_importance
   shares
   array_element
   story_cursor
   story_id
   story_index
   story_processed_date
   story_published_date
   summary
   summary_origin
   id
   score
   title
   probability
   topic_id
   array_element
   tracking_url
   translated_from
   _hoodie_is_deleted
   
   ############ Column(id) ############
   name: id
   path: id
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(version) ############
   name: version
   path: version
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: INT64
   logical_type: None
   converted_type (legacy): NONE
   
   ############ Column(aggregation_id) ############
   name: aggregation_id
   path: aggregation_id
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(area_on_page) ############
   name: area_on_page
   path: area_on_page
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: INT64
   logical_type: None
   converted_type (legacy): NONE
   
   ############ Column(article_cursor) ############
   name: article_cursor
   path: article_cursor
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(article_deduplication_id) ############
   name: article_deduplication_id
   path: article_deduplication_id
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(article_earliest_published_date) ############
   name: article_earliest_published_date
   path: article_earliest_published_date
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(author) ############
   name: author
   path: author
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(canonical_source_id) ############
   name: canonical_source_id
   path: canonical_source_id
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(country) ############
   name: country
   path: canonical_source_location.country
   max_definition_level: 2
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(region) ############
   name: region
   path: canonical_source_location.region
   max_definition_level: 2
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(subregion) ############
   name: subregion
   path: canonical_source_location.subregion
   max_definition_level: 2
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(canonical_source_name) ############
   name: canonical_source_name
   path: canonical_source_name
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(reach_origin) ############
   name: reach_origin
   path: canonical_source_reach.reach_origin
   max_definition_level: 2
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(reach_provider) ############
   name: reach_provider
   path: canonical_source_reach.reach_provider
   max_definition_level: 2
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(source_id) ############
   name: source_id
   path: canonical_source_reach.source_id
   max_definition_level: 2
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(type) ############
   name: type
   path: canonical_source_reach.type
   max_definition_level: 2
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(value) ############
   name: value
   path: canonical_source_reach.value
   max_definition_level: 2
   max_repetition_level: 0
   physical_type: INT64
   logical_type: None
   converted_type (legacy): NONE
   
   ############ Column(content) ############
   name: content
   path: content
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(deleted_date) ############
   name: deleted_date
   path: deleted_date
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(document_type) ############
   name: document_type
   path: document_type
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(eclips_web_url) ############
   name: eclips_web_url
   path: eclips_web_url
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(embargoed_until_date) ############
   name: embargoed_until_date
   path: embargoed_until_date
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(offset) ############
   name: offset
   path: entities.bag.array_element.offset
   max_definition_level: 4
   max_repetition_level: 1
   physical_type: INT64
   logical_type: None
   converted_type (legacy): NONE
   
   ############ Column(overlapping) ############
   name: overlapping
   path: entities.bag.array_element.overlapping
   max_definition_level: 4
   max_repetition_level: 1
   physical_type: BOOLEAN
   logical_type: None
   converted_type (legacy): NONE
   
   ############ Column(position) ############
   name: position
   path: entities.bag.array_element.position
   max_definition_level: 4
   max_repetition_level: 1
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(rule_based_entity) ############
   name: rule_based_entity
   path: entities.bag.array_element.rule_based_entity
   max_definition_level: 4
   max_repetition_level: 1
   physical_type: BOOLEAN
   logical_type: None
   converted_type (legacy): NONE
   
   ############ Column(compound) ############
   name: compound
   path: entities.bag.array_element.sentiment.compound
   max_definition_level: 5
   max_repetition_level: 1
   physical_type: DOUBLE
   logical_type: None
   converted_type (legacy): NONE
   
   ############ Column(neg) ############
   name: neg
   path: entities.bag.array_element.sentiment.neg
   max_definition_level: 5
   max_repetition_level: 1
   physical_type: DOUBLE
   logical_type: None
   converted_type (legacy): NONE
   
   ############ Column(neu) ############
   name: neu
   path: entities.bag.array_element.sentiment.neu
   max_definition_level: 5
   max_repetition_level: 1
   physical_type: DOUBLE
   logical_type: None
   converted_type (legacy): NONE
   
   ############ Column(pos) ############
   name: pos
   path: entities.bag.array_element.sentiment.pos
   max_definition_level: 5
   max_repetition_level: 1
   physical_type: DOUBLE
   logical_type: None
   converted_type (legacy): NONE
   
   ############ Column(signal_type) ############
   name: signal_type
   path: entities.bag.array_element.signal_type
   max_definition_level: 4
   max_repetition_level: 1
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(surface_form) ############
   name: surface_form
   path: entities.bag.array_element.surface_form
   max_definition_level: 4
   max_repetition_level: 1
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(wiki_title) ############
   name: wiki_title
   path: entities.bag.array_element.wiki_title
   max_definition_level: 4
   max_repetition_level: 1
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(salience) ############
   name: salience
   path: entity_salience.bag.array_element.salience
   max_definition_level: 4
   max_repetition_level: 1
   physical_type: DOUBLE
   logical_type: None
   converted_type (legacy): NONE
   
   ############ Column(salience_rank) ############
   name: salience_rank
   path: entity_salience.bag.array_element.salience_rank
   max_definition_level: 4
   max_repetition_level: 1
   physical_type: INT64
   logical_type: None
   converted_type (legacy): NONE
   
   ############ Column(wiki_title) ############
   name: wiki_title
   path: entity_salience.bag.array_element.wiki_title
   max_definition_level: 4
   max_repetition_level: 1
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(feed) ############
   name: feed
   path: feed
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(format_version) ############
   name: format_version
   path: format_version
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: INT64
   logical_type: None
   converted_type (legacy): NONE
   
   ############ Column(ingestion_id) ############
   name: ingestion_id
   path: ingestion_id
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(replay_time) ############
   name: replay_time
   path: ingestion_metadata.replay_time
   max_definition_level: 2
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(replay_type) ############
   name: replay_type
   path: ingestion_metadata.replay_type
   max_definition_level: 2
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(journalist_id) ############
   name: journalist_id
   path: journalist_id
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(journalist_name) ############
   name: journalist_name
   path: journalist_name
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(journalistic_quality) ############
   name: journalistic_quality
   path: journalistic_quality
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: DOUBLE
   logical_type: None
   converted_type (legacy): NONE
   
   ############ Column(language) ############
   name: language
   path: language
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(licence_id) ############
   name: licence_id
   path: licence_id
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(media_type) ############
   name: media_type
   path: media_type
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(array_element) ############
   name: array_element
   path: metadata_keys.bag.array_element
   max_definition_level: 3
   max_repetition_level: 1
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(content) ############
   name: content
   path: native_content.content
   max_definition_level: 2
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(summary) ############
   name: summary
   path: native_content.summary
   max_definition_level: 2
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(title) ############
   name: title
   path: native_content.title
   max_definition_level: 2
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(old_article_ids) ############
   name: old_article_ids
   path: old_article_ids
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(original_id) ############
   name: original_id
   path: original_id
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(original_source_id) ############
   name: original_source_id
   path: original_source_id
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(original_source_name) ############
   name: original_source_name
   path: original_source_name
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(original_url) ############
   name: original_url
   path: original_url
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(page) ############
   name: page
   path: page
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: INT64
   logical_type: None
   converted_type (legacy): NONE
   
   ############ Column(page_section) ############
   name: page_section
   path: page_section
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(pdf_url) ############
   name: pdf_url
   path: pdf_url
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(processed_date) ############
   name: processed_date
   path: processed_date
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(podcast_link) ############
   name: podcast_link
   path: provider_data.podcast_link
   max_definition_level: 2
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(copyright) ############
   name: copyright
   path: provider_data.copyright
   max_definition_level: 2
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(formatted_content) ############
   name: formatted_content
   path: provider_data.formatted_content
   max_definition_level: 2
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(nla_publisher) ############
   name: nla_publisher
   path: provider_data.nla_publisher
   max_definition_level: 2
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(provider_hosted_url) ############
   name: provider_hosted_url
   path: provider_data.provider_hosted_url
   max_definition_level: 2
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(publication_time) ############
   name: publication_time
   path: provider_data.publication_time
   max_definition_level: 2
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(end_date) ############
   name: end_date
   path: provider_data.tvplayer.end_date
   max_definition_level: 3
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(start_date) ############
   name: start_date
   path: provider_data.tvplayer.start_date
   max_definition_level: 3
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(station_id) ############
   name: station_id
   path: provider_data.tvplayer.station_id
   max_definition_level: 3
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(asset_id) ############
   name: asset_id
   path: provider_data.tvplayer.asset_id
   max_definition_level: 3
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(partner_id) ############
   name: partner_id
   path: provider_data.tvplayer.partner_id
   max_definition_level: 3
   max_repetition_level: 0
   physical_type: INT64
   logical_type: None
   converted_type (legacy): NONE
   
   ############ Column(published_date) ############
   name: published_date
   path: published_date
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(end) ############
   name: end
   path: quotes.bag.array_element.end
   max_definition_level: 4
   max_repetition_level: 1
   physical_type: INT64
   logical_type: None
   converted_type (legacy): NONE
   
   ############ Column(start) ############
   name: start
   path: quotes.bag.array_element.start
   max_definition_level: 4
   max_repetition_level: 1
   physical_type: INT64
   logical_type: None
   converted_type (legacy): NONE
   
   ############ Column(text) ############
   name: text
   path: quotes.bag.array_element.text
   max_definition_level: 4
   max_repetition_level: 1
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(received) ############
   name: received
   path: received
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(signal_importance) ############
   name: signal_importance
   path: signal_importance
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: INT64
   logical_type: None
   converted_type (legacy): NONE
   
   ############ Column(shares) ############
   name: shares
   path: social_engagement.twitter.shares
   max_definition_level: 3
   max_repetition_level: 0
   physical_type: INT64
   logical_type: None
   converted_type (legacy): NONE
   
   ############ Column(array_element) ############
   name: array_element
   path: source_groups.bag.array_element
   max_definition_level: 3
   max_repetition_level: 1
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(story_cursor) ############
   name: story_cursor
   path: story_cursor
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(story_id) ############
   name: story_id
   path: story_id
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(story_index) ############
   name: story_index
   path: story_index
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(story_processed_date) ############
   name: story_processed_date
   path: story_processed_date
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(story_published_date) ############
   name: story_published_date
   path: story_published_date
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(summary) ############
   name: summary
   path: summary
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(summary_origin) ############
   name: summary_origin
   path: summary_origin
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(id) ############
   name: id
   path: taxonomy_categories.bag.array_element.id
   max_definition_level: 4
   max_repetition_level: 1
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(score) ############
   name: score
   path: taxonomy_categories.bag.array_element.score
   max_definition_level: 4
   max_repetition_level: 1
   physical_type: DOUBLE
   logical_type: None
   converted_type (legacy): NONE
   
   ############ Column(title) ############
   name: title
   path: title
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(probability) ############
   name: probability
   path: topic_predictions.bag.array_element.probability
   max_definition_level: 4
   max_repetition_level: 1
   physical_type: DOUBLE
   logical_type: None
   converted_type (legacy): NONE
   
   ############ Column(topic_id) ############
   name: topic_id
   path: topic_predictions.bag.array_element.topic_id
   max_definition_level: 4
   max_repetition_level: 1
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(array_element) ############
   name: array_element
   path: topics.bag.array_element
   max_definition_level: 3
   max_repetition_level: 1
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(tracking_url) ############
   name: tracking_url
   path: tracking_url
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(translated_from) ############
   name: translated_from
   path: translated_from
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BYTE_ARRAY
   logical_type: String
   converted_type (legacy): UTF8
   
   ############ Column(_hoodie_is_deleted) ############
   name: _hoodie_is_deleted
   path: _hoodie_is_deleted
   max_definition_level: 1
   max_repetition_level: 0
   physical_type: BOOLEAN
   logical_type: None
   converted_type (legacy): NONE
   ```
   
   Driver Logs leading up to stacktrace:
   
   ```
   21/11/19 12:04:04 INFO SqlFileBasedTransformer: SQL Query for transformation 
: 
   21/11/19 12:04:04 INFO SqlFileBasedTransformer: SELECT
       *,
       story_published_date AS story_published_partition_date
   FROM HOODIE_SRC_TMP_TABLE_c39559a4_ae03_4665_8e56_af7b5f22495e
   21/11/19 12:04:04 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from s3://<bucket>/articles_hudi_copy_on_write/
   21/11/19 12:04:04 INFO HoodieTableConfig: Loading table properties from 
s3://<bucket>/articles_hudi_copy_on_write/.hoodie/hoodie.properties
   21/11/19 12:04:04 INFO S3NativeFileSystem: Opening 
's3://<bucket>/articles_hudi_copy_on_write/.hoodie/hoodie.properties' for 
reading
   21/11/19 12:04:04 INFO HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from 
s3://prod-signal-articles-store/articles_hudi_copy_on_write/
   21/11/19 12:04:04 INFO HoodieActiveTimeline: Loaded instants upto : 
Option{val=[20211119105001__clean__COMPLETED]}
   21/11/19 12:04:04 INFO S3NativeFileSystem: Opening 
's3://<bucket>/articles_hudi_copy_on_write/.hoodie/20211119104902.commit' for 
reading
   21/11/19 12:04:04 INFO FileSourceStrategy: Pushed Filters: 
   21/11/19 12:04:04 INFO FileSourceStrategy: Post-Scan Filters: 
   21/11/19 12:04:04 INFO FileSourceStrategy: Output Data Schema: struct<id: 
string, version: bigint, aggregation_id: string, area_on_page: bigint, 
article_cursor: string ... 59 more fields>
   21/11/19 12:04:05 INFO CodeGenerator: Code generated in 232.874305 ms
   21/11/19 12:04:05 INFO MemoryStore: Block broadcast_1 stored as values in 
memory (estimated size 436.0 KiB, free 3.4 GiB)
   21/11/19 12:04:05 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes 
in memory (estimated size 44.3 KiB, free 3.4 GiB)
   21/11/19 12:04:05 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory 
on ip-10-0-72-160.eu-west-1.compute.internal:45183 (size: 44.3 KiB, free: 3.4 
GiB)
   21/11/19 12:04:05 INFO SparkContext: Created broadcast 1 from toRdd at 
HoodieSparkUtils.scala:133
   21/11/19 12:04:05 INFO FileSourceScanExec: Planning scan with bin packing, 
max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes, 
number of split files: 2, prefetch: false
   21/11/19 12:04:05 INFO FileSourceScanExec: relation: None, 
fileSplitsInPartitionHistogram: Vector((1 fileSplits,2))
   21/11/19 12:04:05 INFO SparkContext: Starting job: isEmpty at 
DeltaSync.java:437
   21/11/19 12:04:05 INFO DAGScheduler: Got job 1 (isEmpty at 
DeltaSync.java:437) with 1 output partitions
   21/11/19 12:04:05 INFO DAGScheduler: Final stage: ResultStage 1 (isEmpty at 
DeltaSync.java:437)
   21/11/19 12:04:05 INFO DAGScheduler: Parents of final stage: List()
   21/11/19 12:04:05 INFO DAGScheduler: Missing parents: List()
   21/11/19 12:04:05 INFO DAGScheduler: Submitting ResultStage 1 
(MapPartitionsRDD[7] at mapPartitions at HoodieSparkUtils.scala:134), which has 
no missing parents
   21/11/19 12:04:05 INFO MemoryStore: Block broadcast_2 stored as values in 
memory (estimated size 252.9 KiB, free 3.4 GiB)
   21/11/19 12:04:05 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes 
in memory (estimated size 54.8 KiB, free 3.4 GiB)
   21/11/19 12:04:05 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory 
on ip-10-0-72-160.eu-west-1.compute.internal:45183 (size: 54.8 KiB, free: 3.4 
GiB)
   21/11/19 12:04:05 INFO SparkContext: Created broadcast 2 from broadcast at 
DAGScheduler.scala:1484
   21/11/19 12:04:05 INFO DAGScheduler: Submitting 1 missing tasks from 
ResultStage 1 (MapPartitionsRDD[7] at mapPartitions at 
HoodieSparkUtils.scala:134) (first 15 tasks are for partitions Vector(0))
   21/11/19 12:04:05 INFO YarnClusterScheduler: Adding task set 1.0 with 1 
tasks resource profile 0
   21/11/19 12:04:05 INFO FairSchedulableBuilder: Added task set TaskSet_1.0 
tasks to pool default
   21/11/19 12:04:05 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 
1) (ip-10-0-72-160.eu-west-1.compute.internal, executor 6, partition 0, 
RACK_LOCAL, 5244 bytes) taskResourceAssignments Map()
   21/11/19 12:04:05 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory 
on ip-10-0-72-160.eu-west-1.compute.internal:42321 (size: 54.8 KiB, free: 3.4 
GiB)
   21/11/19 12:04:06 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory 
on ip-10-0-72-160.eu-west-1.compute.internal:42321 (size: 44.3 KiB, free: 3.4 
GiB)
   21/11/19 12:04:08 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1) 
(ip-10-0-72-160.eu-west-1.compute.internal executor 6): 
java.lang.ClassCastException: java.lang.String cannot be cast to 
org.apache.spark.sql.Row
        at 
org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$15(AvroConversionHelper.scala:358)
        at 
org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$15(AvroConversionHelper.scala:362)
        at 
org.apache.hudi.HoodieSparkUtils$.$anonfun$createRddInternal$3(HoodieSparkUtils.scala:138)
        at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
        at scala.collection.Iterator$SliceIterator.next(Iterator.scala:271)
        at scala.collection.Iterator.foreach(Iterator.scala:941)
        at scala.collection.Iterator.foreach$(Iterator.scala:941)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
        at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
        at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
        at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
        at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
        at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
        at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
        at scala.collection.AbstractIterator.to(Iterator.scala:1429)
        at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
        at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
        at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
        at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
        at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
        at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
        at org.apache.spark.rdd.RDD.$anonfun$take$2(RDD.scala:1449)
        at 
org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2281)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:131)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to