maheshguptags commented on issue #10609:
URL: https://github.com/apache/hudi/issues/10609#issuecomment-1931327718
@ad1happy2go as discussed, I have tried hudi delta stream but unfortunately,
I could not execute it due to heap space issues even without sending any data.
**Command**
```
spark/bin/spark-submit \
--name customer-event-hudideltaStream \
--num-executors 10 \
--executor-memory 3g \
--driver-memory 6g \
--conf spark.task.cpus=1 \
--conf spark.driver.extraJavaOptions="-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCTimeStamps
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/varadarb_ds_driver.hprof"
\
--conf spark.executor.extraJavaOptions="-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCTimeStamps
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/tmp/varadarb_ds_executor.hprof" \
--packages org.apache.hadoop:hadoop-aws:3.3.4 \
--jars
/home/mahesh.gupta/hudi-utilities-bundle_2.12-0.14.0.jar,/opt/kafka_2.13-2.8.1/aws-msk-iam-auth-1.1.9-all.jar
\
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
/home/mahesh.gupta/hudi-utilities-slim-bundle_2.12-0.14.0.jar \
--checkpoint
s3a://cdp-offline-xxx/checkpointing/eks/sparkhudipoc/hudistream_rli_3 \
--target-base-path s3a://cdp-offline-xxx/huditream_rli_3 \
--target-table customer_profile \
--table-type COPY_ON_WRITE \
--base-file-format PARQUET \
--props /home/mahesh.gupta/hoodie.properties \
--source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
--source-ordering-field updated_date \
--payload-class org.apache.hudi.common.model.DefaultHoodieRecordPayload \
--schemaprovider-class
org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
--hoodie-conf
hoodie.streamer.schemaprovider.source.schema.file=/home/mahesh.gupta/source.avsc
\
--hoodie-conf
hoodie.streamer.schemaprovider.target.schema.file=/home/mahesh.gupta/source.avsc
\
--op UPSERT \
--hoodie-conf hoodie.streamer.source.kafka.topic=spark_hudi_temp \
--hoodie-conf hoodie.datasource.write.partitionpath.field=client_id \
--continuous
```
**Stacktrace for same**
```
auto.offset.reset: latest
bootstrap.servers: local:9092
hoodie.auto.adjust.lock.configs: true
hoodie.clean.async: true
hoodie.clean.automatic: true
hoodie.clean.max.commits: 6
hoodie.clean.trigger.strategy: NUM_COMMITS
hoodie.cleaner.commits.retained: 4
hoodie.cleaner.parallelism: 50
hoodie.datasource.write.partitionpath.field: client_id
hoodie.datasource.write.precombine.field: updated_date
hoodie.datasource.write.reconcile.schema: false
hoodie.datasource.write.recordkey.field: customer_id,client_id
hoodie.index.type: RECORD_INDEX
hoodie.metadata.record.index.enable: true
hoodie.metadata.record.index.max.filegroup.count: 5000
hoodie.metadata.record.index.min.filegroup.count: 20
hoodie.parquet.compression.codec: snappy
hoodie.streamer.schemaprovider.source.schema.file:
/home/mahesh.gupta/source.avsc
hoodie.streamer.schemaprovider.target.schema.file:
/home/mahesh.gupta/source.avsc
hoodie.streamer.source.kafka.topic: spark_hudi_temp
sasl.client.callback.handler.class: SENSITIVE_INFO_MASKED
sasl.jaas.config: SENSITIVE_INFO_MASKED
sasl.mechanism: SENSITIVE_INFO_MASKED
security.protocol: SASL_SSL
24/02/06 07:12:20 INFO FSUtils: Resolving file
/home/mahesh.gupta/source.avscto be a remote file.
24/02/06 07:12:20 INFO HoodieSparkKeyGeneratorFactory: The value of
hoodie.datasource.write.keygenerator.type is empty; inferred to be COMPLEX
24/02/06 07:12:20 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient
from s3a://cdp-offline-xxx/huditream_rli_3
24/02/06 07:12:20 INFO HoodieTableConfig: Loading table properties from
s3a://cdp-offline-xxx/huditream_rli_3/.hoodie/hoodie.properties
24/02/06 07:12:20 INFO HoodieTableMetaClient: Finished Loading Table of type
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from
s3a://cdp-offline-xxx/huditream_rli_3
24/02/06 07:12:20 INFO HoodieActiveTimeline: Loaded instants upto :
Option{val=[20240205111704768__commit__COMPLETED__20240205111748000]}
24/02/06 07:12:21 INFO HoodieWriteConfig: Automatically set
hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.InProcessLockProvider
since user has not set the lock provider for single writer with async table
services
24/02/06 07:12:21 INFO HoodieWriteConfig: Automatically set
hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.InProcessLockProvider
since user has not set the lock provider for single writer with async table
services
24/02/06 07:12:21 INFO HoodieIngestionService: Ingestion service starts
running in continuous mode
24/02/06 07:12:21 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient
from s3a://cdp-offline-xxx/huditream_rli_3
24/02/06 07:12:21 INFO HoodieTableConfig: Loading table properties from
s3a://cdp-offline-xxxdev/huditream_rli_3/.hoodie/hoodie.properties
24/02/06 07:12:21 INFO HoodieTableMetaClient: Finished Loading Table of type
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from
s3a://cdp-offline-xxx/huditream_rli_3
24/02/06 07:12:21 INFO HoodieActiveTimeline: Loaded instants upto :
Option{val=[20240205111704768__commit__COMPLETED__20240205111748000]}
24/02/06 07:12:21 INFO StreamSync: Checkpoint to resume from :
Option{val=spark_hudi_temp,0:732979,1:727818,2:725765,3:719464,4:721968,5:727487,6:737757,7:727566,8:736890,9:722032,10:723030,11:724587,12:723768,13:732789,14:721004,15:721541,16:734303,17:717704,18:734645,19:721914}
24/02/06 07:12:21 INFO ConsumerConfig: ConsumerConfig values:
allow.auto.create.topics = true
auto.commit.interval.ms = 5000
auto.offset.reset = latest
bootstrap.servers = local:9092
check.crcs = true
client.dns.lookup = use_all_dns_ips
client.id = consumer-null-1
client.rack =
connections.max.idle.ms = 540000
default.api.timeout.ms = 60000
enable.auto.commit = true
exclude.internal.topics = true
fetch.max.bytes = 52428800
fetch.max.wait.ms = 500
fetch.min.bytes = 1
group.id = null
group.instance.id = null
heartbeat.interval.ms = 3000
interceptor.classes = []
internal.leave.group.on.close = true
internal.throw.on.fetch.stable.offset.unsupported = false
isolation.level = read_uncommitted
key.deserializer = class
org.apache.kafka.common.serialization.StringDeserializer
max.partition.fetch.bytes = 1048576
max.poll.interval.ms = 300000
max.poll.records = 500
metadata.max.age.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 30000
partition.assignment.strategy = [class
org.apache.kafka.clients.consumer.RangeAssignor]
receive.buffer.bytes = 65536
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
request.timeout.ms = 30000
retry.backoff.ms = 100
sasl.client.callback.handler.class = class
software.amazon.msk.auth.iam.IAMClientCallbackHandler
sasl.jaas.config = [hidden]
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 60000
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.login.callback.handler.class = null
sasl.login.class = null
sasl.login.refresh.buffer.seconds = 300
sasl.login.refresh.min.period.seconds = 60
sasl.login.refresh.window.factor = 0.8
sasl.login.refresh.window.jitter = 0.05
sasl.mechanism = AWS_MSK_IAM
security.protocol = SASL_SSL
security.providers = null
send.buffer.bytes = 131072
session.timeout.ms = 10000
socket.connection.setup.timeout.max.ms = 30000
socket.connection.setup.timeout.ms = 10000
ssl.cipher.suites = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.3]
ssl.endpoint.identification.algorithm = https
ssl.engine.factory.class = null
ssl.key.password = null
ssl.keymanager.algorithm = SunX509
ssl.keystore.certificate.chain = null
ssl.keystore.key = null
ssl.keystore.location = null
ssl.keystore.password = null
ssl.keystore.type = JKS
ssl.protocol = TLSv1.3
ssl.provider = null
ssl.secure.random.implementation = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.certificates = null
ssl.truststore.location = null
ssl.truststore.password = null
ssl.truststore.type = JKS
value.deserializer = class
org.apache.kafka.common.serialization.StringDeserializer
24/02/06 07:12:22 INFO AbstractLogin: Successfully logged in.
24/02/06 07:12:22 INFO AppInfoParser: Kafka version: 2.8.0
24/02/06 07:12:22 INFO AppInfoParser: Kafka commitId: ebb1d6e21cc92130
24/02/06 07:12:22 INFO AppInfoParser: Kafka startTimeMs: 1707203542234
24/02/06 07:12:23 INFO Metadata: [Consumer clientId=consumer-null-1,
groupId=null] Cluster ID: H-NPJc0UTZ6XH3XCAnEDOw
24/02/06 07:12:25 INFO Metrics: Metrics scheduler closed
24/02/06 07:12:25 INFO Metrics: Closing reporter
org.apache.kafka.common.metrics.JmxReporter
24/02/06 07:12:25 INFO Metrics: Metrics reporters closed
24/02/06 07:12:25 INFO AppInfoParser: App info kafka.consumer for
consumer-null-1 unregistered
24/02/06 07:12:25 INFO KafkaOffsetGen: SourceLimit not configured, set
numEvents to default value : 5000000
24/02/06 07:12:25 INFO KafkaOffsetGen: getNextOffsetRanges set config
hoodie.streamer.source.kafka.minPartitions to 0
Killed
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]