[I] [SUPPORT] unable to sync metadata to hive metastore [hudi]

via GitHub Sun, 30 Mar 2025 11:02:35 -0700


Souldiv opened a new issue, #13057:
URL: https://github.com/apache/hudi/issues/13057


   **Describe the problem you faced**
   
   I am trying to store table metadata in hive metastore using the following 
spark command. I have followed the config as shown 
[here](https://hudi.apache.org/docs/0.15.0/configurations/#META_SYNC). And the 
following command is run:
   
   ```bash
   spark-submit --class org.apache.hudi.utilities.streamer.HoodieStr
   eamer $HUDI_UTILITIES_BUNDLE \
   --table-type COPY_ON_WRITE \
   --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
   --source-ordering-field ts \
   --target-base-path 
hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2  \
   --target-table stock_ticks_cow_2  \
   --schemaprovider-class 
org.apache.hudi.utilities.schema.SchemaRegistryProvider \
   --hoodie-conf 
hoodie.streamer.schemaprovider.registry.url=http://localhost:8081/subjects/stock_ticks-value/versions/latest
 \
   --hoodie-conf hoodie.streamer.source.kafka.topic=stock_ticks \
   --hoodie-conf hoodie.datasource.write.recordkey.field=key \
   --hoodie-conf hoodie.datasource.write.partitionpath.field=date \
   --hoodie-conf schema.registry.url=http://localhost:8081 \
   --hoodie-conf auto.offset.reset=earliest \
   --hoodie-conf bootstrap.servers=localhost:9092 \
   --hoodie-conf hoodie.upsert.shuffle.parallelism=2 \
   --hoodie-conf hoodie.insert.shuffle.parallelism=2 \
   --hoodie-conf hoodie.delete.shuffle.parallelism=2 \
   --hoodie-conf hoodie.bulkinsert.shuffle.parallelism=2 \
   --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
   --hoodie-conf hoodie.datasource.hive_sync.enable=true \
   --hoodie-conf 
hoodie.datasource.hive_sync.metastore.uris=thrift://localhost:9083 \
   --hoodie-conf hoodie.datasource.hive_sync.table=stock_ticks_cow_2 \
   --hoodie-conf hoodie.datasource.meta.sync.enable=true \
   --hoodie-conf hoodie.datasource.hive_sync.batch_num=10 \
   --props file:///dev/null
   ```
   
   spark writes the table as intended to hdfs but I don't see the table 
metadata in hive through beeline. Please let me know if I am missing any 
required configuration or If I have misunderstood the purpose of this 
configuration.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. push stock data to `stock_ticks` topic
   2. run above spark command
   3. check from beeline if tables shows up using `show tables;`
   
   **Expected behavior**
   
   I was expecting the table metadata to be synced with hive upon running the 
spark command with hive configuration.
   
   **Environment Description**
   
   * Hudi version : 0.15
   
   * Spark version : 3.5.5
   
   * Hive version : 2.3.9
   
   * Hadoop version : 3.4.1
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : No
   
   
   **Stacktrace**
   
   
   ```
   25/03/30 17:42:33 WARN Utils: Your hostname, hudi resolves to a loopback 
address: 127.0.1.1; using 10.0.0.108 instead (on interface eth0)
   25/03/30 17:42:33 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
another address
   25/03/30 17:42:33 WARN SchedulerConfGenerator: Job Scheduling Configs will 
not be in effect as spark.scheduler.mode is not set to FAIR at instantiation 
time. Continuing without scheduling configs
   25/03/30 17:42:34 INFO SparkContext: Running Spark version 3.5.5
   25/03/30 17:42:34 INFO SparkContext: OS info Linux, 6.8.4-3-pve, amd64
   25/03/30 17:42:34 INFO SparkContext: Java version 1.8.0_442
   25/03/30 17:42:34 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
   25/03/30 17:42:34 INFO ResourceUtils: 
==============================================================
   25/03/30 17:42:34 INFO ResourceUtils: No custom resources configured for 
spark.driver.
   25/03/30 17:42:34 INFO ResourceUtils: 
==============================================================
   25/03/30 17:42:34 INFO SparkContext: Submitted application: 
streamer-stock_ticks_cow_2
   25/03/30 17:42:34 INFO ResourceProfile: Default ResourceProfile created, 
executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , 
memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: 
offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: 
cpus, amount: 1.0)
   25/03/30 17:42:34 INFO ResourceProfile: Limiting resource is cpu
   25/03/30 17:42:34 INFO ResourceProfileManager: Added ResourceProfile id: 0
   25/03/30 17:42:34 INFO SecurityManager: Changing view acls to: conuser
   25/03/30 17:42:34 INFO SecurityManager: Changing modify acls to: conuser
   25/03/30 17:42:34 INFO SecurityManager: Changing view acls groups to: 
   25/03/30 17:42:34 INFO SecurityManager: Changing modify acls groups to: 
   25/03/30 17:42:34 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: conuser; groups with 
view permissions: EMPTY; users with modify permissions: conuser; groups with 
modify permissions: EMPTY
   25/03/30 17:42:34 INFO deprecation: mapred.output.compression.codec is 
deprecated. Instead, use mapreduce.output.fileoutputformat.compress.codec
   25/03/30 17:42:34 INFO deprecation: mapred.output.compress is deprecated. 
Instead, use mapreduce.output.fileoutputformat.compress
   25/03/30 17:42:34 INFO deprecation: mapred.output.compression.type is 
deprecated. Instead, use mapreduce.output.fileoutputformat.compress.type
   25/03/30 17:42:34 INFO Utils: Successfully started service 'sparkDriver' on 
port 44127.
   25/03/30 17:42:34 INFO SparkEnv: Registering MapOutputTracker
   25/03/30 17:42:34 INFO SparkEnv: Registering BlockManagerMaster
   25/03/30 17:42:34 INFO BlockManagerMasterEndpoint: Using 
org.apache.spark.storage.DefaultTopologyMapper for getting topology information
   25/03/30 17:42:34 INFO BlockManagerMasterEndpoint: 
BlockManagerMasterEndpoint up
   25/03/30 17:42:34 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
   25/03/30 17:42:34 INFO DiskBlockManager: Created local directory at 
/tmp/blockmgr-970f83dc-4465-4290-a3dd-b6a401ed3feb
   25/03/30 17:42:34 INFO MemoryStore: MemoryStore started with capacity 366.3 
MiB
   25/03/30 17:42:34 INFO SparkEnv: Registering OutputCommitCoordinator
   25/03/30 17:42:34 INFO JettyUtils: Start Jetty 0.0.0.0:8090 for SparkUI
   25/03/30 17:42:34 WARN Utils: Service 'SparkUI' could not bind on port 8090. 
Attempting port 8091.
   25/03/30 17:42:34 INFO Utils: Successfully started service 'SparkUI' on port 
8091.
   25/03/30 17:42:34 INFO SparkContext: Added JAR 
file:/home/conuser/downloads/hudi-0.15.0/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.12-0.15.0.jar
 at spark://10.0.0.108:44127/jars/hudi-utilities-bundle_2.12-0.15.0.jar with 
timestamp 1743356554014
   25/03/30 17:42:34 INFO Executor: Starting executor ID driver on host 
10.0.0.108
   25/03/30 17:42:34 INFO Executor: OS info Linux, 6.8.4-3-pve, amd64
   25/03/30 17:42:34 INFO Executor: Java version 1.8.0_442
   25/03/30 17:42:34 INFO Executor: Starting executor with user classpath 
(userClassPathFirst = false): ''
   25/03/30 17:42:34 INFO Executor: Created or updated repl class loader 
org.apache.spark.util.MutableURLClassLoader@365a6a43 for default.
   25/03/30 17:42:34 INFO Executor: Fetching 
spark://10.0.0.108:44127/jars/hudi-utilities-bundle_2.12-0.15.0.jar with 
timestamp 1743356554014
   25/03/30 17:42:34 INFO TransportClientFactory: Successfully created 
connection to /10.0.0.108:44127 after 19 ms (0 ms spent in bootstraps)
   25/03/30 17:42:34 INFO Utils: Fetching 
spark://10.0.0.108:44127/jars/hudi-utilities-bundle_2.12-0.15.0.jar to 
/tmp/spark-8b36c157-3895-45ce-86b2-5a063c272795/userFiles-2caada7f-5b56-4053-8db1-5b00562db47c/fetchFileTemp821209291924917814.tmp
   25/03/30 17:42:34 INFO Executor: Adding 
file:/tmp/spark-8b36c157-3895-45ce-86b2-5a063c272795/userFiles-2caada7f-5b56-4053-8db1-5b00562db47c/hudi-utilities-bundle_2.12-0.15.0.jar
 to class loader default
   25/03/30 17:42:34 INFO Utils: Successfully started service 
'org.apache.spark.network.netty.NettyBlockTransferService' on port 35865.
   25/03/30 17:42:34 INFO NettyBlockTransferService: Server created on 
10.0.0.108:35865
   25/03/30 17:42:34 INFO BlockManager: Using 
org.apache.spark.storage.RandomBlockReplicationPolicy for block replication 
policy
   25/03/30 17:42:34 INFO BlockManagerMaster: Registering BlockManager 
BlockManagerId(driver, 10.0.0.108, 35865, None)
   25/03/30 17:42:34 INFO BlockManagerMasterEndpoint: Registering block manager 
10.0.0.108:35865 with 366.3 MiB RAM, BlockManagerId(driver, 10.0.0.108, 35865, 
None)
   25/03/30 17:42:34 INFO BlockManagerMaster: Registered BlockManager 
BlockManagerId(driver, 10.0.0.108, 35865, None)
   25/03/30 17:42:34 INFO BlockManager: Initialized BlockManager: 
BlockManagerId(driver, 10.0.0.108, 35865, None)
   25/03/30 17:42:35 WARN DFSPropertiesConfiguration: Cannot find 
HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf
   25/03/30 17:42:35 WARN DFSPropertiesConfiguration: Properties file 
file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
   25/03/30 17:42:35 INFO UtilHelpers: Adding overridden properties to file 
properties.
   25/03/30 17:42:35 INFO SharedState: spark.sql.warehouse.dir is not set, but 
hive.metastore.warehouse.dir is set. Setting spark.sql.warehouse.dir to the 
value of hive.metastore.warehouse.dir.
   25/03/30 17:42:35 INFO SharedState: Warehouse path is 
'hdfs://localhost:9000/user/hive/warehouse'.
   25/03/30 17:42:35 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2
   25/03/30 17:42:35 INFO HoodieTableConfig: Loading table properties from 
hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2/.hoodie/hoodie.properties
   25/03/30 17:42:35 INFO HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from 
hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2
   25/03/30 17:42:35 INFO HoodieStreamer: Creating Hudi Streamer with configs:
   auto.offset.reset: earliest
   bootstrap.servers: localhost:9092
   hoodie.auto.adjust.lock.configs: true
   hoodie.bulkinsert.shuffle.parallelism: 2
   hoodie.datasource.hive_sync.batch_num: 10
   hoodie.datasource.hive_sync.enable: true
   hoodie.datasource.hive_sync.metastore.uris: thrift://localhost:9083
   hoodie.datasource.hive_sync.mode: hms
   hoodie.datasource.hive_sync.table: stock_ticks_cow_2
   hoodie.datasource.meta.sync.enable: true
   hoodie.datasource.write.partitionpath.field: date
   hoodie.datasource.write.reconcile.schema: false
   hoodie.datasource.write.recordkey.field: key
   hoodie.delete.shuffle.parallelism: 2
   hoodie.insert.shuffle.parallelism: 2
   hoodie.streamer.schemaprovider.registry.url: 
http://localhost:8081/subjects/stock_ticks-value/versions/latest
   hoodie.streamer.source.kafka.topic: stock_ticks
   hoodie.upsert.shuffle.parallelism: 2
   schema.registry.url: http://localhost:8081
   
   25/03/30 17:42:35 INFO HoodieSparkKeyGeneratorFactory: The value of 
hoodie.datasource.write.keygenerator.type is empty; inferred to be SIMPLE
   25/03/30 17:42:35 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2
   25/03/30 17:42:35 INFO HoodieTableConfig: Loading table properties from 
hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2/.hoodie/hoodie.properties
   25/03/30 17:42:35 INFO HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from 
hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2
   25/03/30 17:42:35 INFO HoodieActiveTimeline: Loaded instants upto : 
Option{val=[20250330173718165__commit__COMPLETED__20250330173723152]}
   25/03/30 17:42:35 INFO HoodieIngestionService: Ingestion service starts 
running in run-once mode
   25/03/30 17:42:35 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2
   25/03/30 17:42:35 INFO HoodieTableConfig: Loading table properties from 
hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2/.hoodie/hoodie.properties
   25/03/30 17:42:35 INFO HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from 
hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2
   25/03/30 17:42:35 INFO HoodieActiveTimeline: Loaded instants upto : 
Option{val=[20250330173718165__commit__COMPLETED__20250330173723152]}
   25/03/30 17:42:35 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2
   25/03/30 17:42:35 INFO HoodieTableConfig: Loading table properties from 
hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2/.hoodie/hoodie.properties
   25/03/30 17:42:35 INFO HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from 
hdfs://localhost:9000/user/hive/warehouse/stock_ticks_cow_2
   25/03/30 17:42:36 INFO StreamSync: Checkpoint to resume from : 
Option{val=stock_ticks,0:3482}
   25/03/30 17:42:36 INFO KafkaOffsetGen: SourceLimit not configured, set 
numEvents to default value : 5000000
   25/03/30 17:42:36 INFO KafkaOffsetGen: getNextOffsetRanges set config 
hoodie.streamer.source.kafka.minPartitions to 0
   25/03/30 17:42:36 INFO ConsumerConfig: ConsumerConfig values: 
        allow.auto.create.topics = true
        auto.commit.interval.ms = 5000
        auto.offset.reset = earliest
        bootstrap.servers = [localhost:9092]
        check.crcs = true
        client.dns.lookup = use_all_dns_ips
        client.id = consumer-null-1
        client.rack = 
        connections.max.idle.ms = 540000
        default.api.timeout.ms = 60000
        enable.auto.commit = true
        exclude.internal.topics = true
        fetch.max.bytes = 52428800
        fetch.max.wait.ms = 500
        fetch.min.bytes = 1
        group.id = null
        group.instance.id = null
        heartbeat.interval.ms = 3000
        interceptor.classes = []
        internal.leave.group.on.close = true
        internal.throw.on.fetch.stable.offset.unsupported = false
        isolation.level = read_uncommitted
        key.deserializer = class 
org.apache.kafka.common.serialization.StringDeserializer
        max.partition.fetch.bytes = 1048576
        max.poll.interval.ms = 300000
        max.poll.records = 500
        metadata.max.age.ms = 300000
        metric.reporters = []
        metrics.num.samples = 2
        metrics.recording.level = INFO
        metrics.sample.window.ms = 30000
        partition.assignment.strategy = [class 
org.apache.kafka.clients.consumer.RangeAssignor]
        receive.buffer.bytes = 65536
        reconnect.backoff.max.ms = 1000
        reconnect.backoff.ms = 50
        request.timeout.ms = 30000
        retry.backoff.ms = 100
        sasl.client.callback.handler.class = null
        sasl.jaas.config = null
        sasl.kerberos.kinit.cmd = /usr/bin/kinit
        sasl.kerberos.min.time.before.relogin = 60000
        sasl.kerberos.service.name = null
        sasl.kerberos.ticket.renew.jitter = 0.05
        sasl.kerberos.ticket.renew.window.factor = 0.8
        sasl.login.callback.handler.class = null
        sasl.login.class = null
        sasl.login.refresh.buffer.seconds = 300
        sasl.login.refresh.min.period.seconds = 60
        sasl.login.refresh.window.factor = 0.8
        sasl.login.refresh.window.jitter = 0.05
        sasl.mechanism = GSSAPI
        security.protocol = PLAINTEXT
        security.providers = null
        send.buffer.bytes = 131072
        session.timeout.ms = 10000
        socket.connection.setup.timeout.max.ms = 30000
        socket.connection.setup.timeout.ms = 10000
        ssl.cipher.suites = null
        ssl.enabled.protocols = [TLSv1.2]
        ssl.endpoint.identification.algorithm = https
        ssl.engine.factory.class = null
        ssl.key.password = null
        ssl.keymanager.algorithm = SunX509
        ssl.keystore.certificate.chain = null
        ssl.keystore.key = null
        ssl.keystore.location = null
        ssl.keystore.password = null
        ssl.keystore.type = JKS
        ssl.protocol = TLSv1.2
        ssl.provider = null
        ssl.secure.random.implementation = null
        ssl.trustmanager.algorithm = PKIX
        ssl.truststore.certificates = null
        ssl.truststore.location = null
        ssl.truststore.password = null
        ssl.truststore.type = JKS
        value.deserializer = class 
org.apache.kafka.common.serialization.StringDeserializer
   
   25/03/30 17:42:36 WARN ConsumerConfig: The configuration 
'schema.registry.url' was supplied but isn't a known config.
   25/03/30 17:42:36 INFO AppInfoParser: Kafka version: 2.8.0
   25/03/30 17:42:36 INFO AppInfoParser: Kafka commitId: ebb1d6e21cc92130
   25/03/30 17:42:36 INFO AppInfoParser: Kafka startTimeMs: 1743356556089
   25/03/30 17:42:36 INFO Metadata: [Consumer clientId=consumer-null-1, 
groupId=null] Cluster ID: Nk-xOeixRZGj41miDeXdjQ
   25/03/30 17:42:36 INFO Metrics: Metrics scheduler closed
   25/03/30 17:42:36 INFO Metrics: Closing reporter 
org.apache.kafka.common.metrics.JmxReporter
   25/03/30 17:42:36 INFO Metrics: Metrics reporters closed
   25/03/30 17:42:36 INFO AppInfoParser: App info kafka.consumer for 
consumer-null-1 unregistered
   25/03/30 17:42:36 INFO KafkaOffsetGen: final ranges [OffsetRange(topic: 
'stock_ticks', partition: 0, range: [3482 -> 3482])]
   25/03/30 17:42:36 INFO KafkaSource: About to read sourceLimit 
9223372036854775807 in 0 spark partitions from kafka for topic stock_ticks with 
offset ranges [OffsetRange(topic: 'stock_ticks', partition: 0, range: [3482 -> 
3482])]
   25/03/30 17:42:36 INFO KafkaSource: About to read 0 from Kafka for topic 
:stock_ticks
   25/03/30 17:42:36 INFO HoodieActiveTimeline: Loaded instants upto : 
Option{val=[20250330173718165__commit__COMPLETED__20250330173723152]}
   25/03/30 17:42:36 INFO UtilHelpers: Adding overridden properties to file 
properties.
   25/03/30 17:42:36 INFO StreamSync: No new data, source checkpoint has not 
changed. Nothing to commit. Old checkpoint=(Option{val=stock_ticks,0:3482}). 
New Checkpoint=(stock_ticks,0:3482)
   25/03/30 17:42:36 INFO StreamSync: Shutting down embedded timeline server
   25/03/30 17:42:36 INFO HoodieIngestionService: Ingestion service (run-once 
mode) has been shut down.
   25/03/30 17:42:36 INFO SparkContext: SparkContext is stopping with exitCode 
0.
   25/03/30 17:42:36 INFO SparkUI: Stopped Spark web UI at 
http://10.0.0.108:8091
   25/03/30 17:42:36 INFO MapOutputTrackerMasterEndpoint: 
MapOutputTrackerMasterEndpoint stopped!
   25/03/30 17:42:36 INFO MemoryStore: MemoryStore cleared
   25/03/30 17:42:36 INFO BlockManager: BlockManager stopped
   25/03/30 17:42:36 INFO BlockManagerMaster: BlockManagerMaster stopped
   25/03/30 17:42:36 INFO 
OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
OutputCommitCoordinator stopped!
   25/03/30 17:42:36 INFO SparkContext: Successfully stopped SparkContext
   25/03/30 17:42:36 INFO ShutdownHookManager: Shutdown hook called
   25/03/30 17:42:36 INFO ShutdownHookManager: Deleting directory 
/tmp/spark-37076236-cc75-4ba3-a7bc-65a0778326a0
   25/03/30 17:42:36 INFO ShutdownHookManager: Deleting directory 
/tmp/spark-8b36c157-3895-45ce-86b2-5a063c272795
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] [SUPPORT] unable to sync metadata to hive metastore [hudi]

Reply via email to