maheshguptags commented on issue #10609:
URL: https://github.com/apache/hudi/issues/10609#issuecomment-1931327718

   @ad1happy2go as discussed, I have tried hudi delta stream but unfortunately, 
I could not execute it due to heap space issues even without sending any data.
   
   **Command** 
   ``` 
   spark/bin/spark-submit \
   --name customer-event-hudideltaStream \
   --num-executors 10 \
   --executor-memory 3g \
   --driver-memory 6g \
   --conf spark.task.cpus=1 \
   --conf spark.driver.extraJavaOptions="-XX:+PrintGCApplicationStoppedTime 
-XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCTimeStamps 
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/varadarb_ds_driver.hprof" 
\
   --conf spark.executor.extraJavaOptions="-XX:+PrintGCApplicationStoppedTime 
-XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCTimeStamps 
-XX:+HeapDumpOnOutOfMemoryError 
-XX:HeapDumpPath=/tmp/varadarb_ds_executor.hprof" \
   --packages org.apache.hadoop:hadoop-aws:3.3.4 \
   --jars 
/home/mahesh.gupta/hudi-utilities-bundle_2.12-0.14.0.jar,/opt/kafka_2.13-2.8.1/aws-msk-iam-auth-1.1.9-all.jar
 \
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
/home/mahesh.gupta/hudi-utilities-slim-bundle_2.12-0.14.0.jar \
   --checkpoint 
s3a://cdp-offline-xxx/checkpointing/eks/sparkhudipoc/hudistream_rli_3 \
   --target-base-path s3a://cdp-offline-xxx/huditream_rli_3 \
   --target-table customer_profile \
   --table-type COPY_ON_WRITE \
   --base-file-format PARQUET \
   --props /home/mahesh.gupta/hoodie.properties \
   --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
   --source-ordering-field updated_date \
   --payload-class org.apache.hudi.common.model.DefaultHoodieRecordPayload \
   --schemaprovider-class 
org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
   --hoodie-conf 
hoodie.streamer.schemaprovider.source.schema.file=/home/mahesh.gupta/source.avsc
 \
   --hoodie-conf 
hoodie.streamer.schemaprovider.target.schema.file=/home/mahesh.gupta/source.avsc
 \
   --op UPSERT \
   --hoodie-conf hoodie.streamer.source.kafka.topic=spark_hudi_temp \
   --hoodie-conf hoodie.datasource.write.partitionpath.field=client_id \
   --continuous 
   ```
   
   **Stacktrace for same** 
   ```
   auto.offset.reset: latest
   bootstrap.servers: local:9092
   hoodie.auto.adjust.lock.configs: true
   hoodie.clean.async: true
   hoodie.clean.automatic: true
   hoodie.clean.max.commits: 6
   hoodie.clean.trigger.strategy: NUM_COMMITS
   hoodie.cleaner.commits.retained: 4
   hoodie.cleaner.parallelism: 50
   hoodie.datasource.write.partitionpath.field: client_id
   hoodie.datasource.write.precombine.field: updated_date
   hoodie.datasource.write.reconcile.schema: false
   hoodie.datasource.write.recordkey.field: customer_id,client_id
   hoodie.index.type: RECORD_INDEX
   hoodie.metadata.record.index.enable: true
   hoodie.metadata.record.index.max.filegroup.count: 5000
   hoodie.metadata.record.index.min.filegroup.count: 20
   hoodie.parquet.compression.codec: snappy
   hoodie.streamer.schemaprovider.source.schema.file: 
/home/mahesh.gupta/source.avsc
   hoodie.streamer.schemaprovider.target.schema.file: 
/home/mahesh.gupta/source.avsc
   hoodie.streamer.source.kafka.topic: spark_hudi_temp
   sasl.client.callback.handler.class: SENSITIVE_INFO_MASKED
   sasl.jaas.config: SENSITIVE_INFO_MASKED
   sasl.mechanism: SENSITIVE_INFO_MASKED
   security.protocol: SASL_SSL
   
   24/02/06 07:12:20 INFO FSUtils: Resolving file 
/home/mahesh.gupta/source.avscto be a remote file.
   24/02/06 07:12:20 INFO HoodieSparkKeyGeneratorFactory: The value of 
hoodie.datasource.write.keygenerator.type is empty; inferred to be COMPLEX
   24/02/06 07:12:20 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from s3a://cdp-offline-xxx/huditream_rli_3
   24/02/06 07:12:20 INFO HoodieTableConfig: Loading table properties from 
s3a://cdp-offline-xxx/huditream_rli_3/.hoodie/hoodie.properties
   24/02/06 07:12:20 INFO HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from 
s3a://cdp-offline-xxx/huditream_rli_3
   24/02/06 07:12:20 INFO HoodieActiveTimeline: Loaded instants upto : 
Option{val=[20240205111704768__commit__COMPLETED__20240205111748000]}
   24/02/06 07:12:21 INFO HoodieWriteConfig: Automatically set 
hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.InProcessLockProvider
 since user has not set the lock provider for single writer with async table 
services
   24/02/06 07:12:21 INFO HoodieWriteConfig: Automatically set 
hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.InProcessLockProvider
 since user has not set the lock provider for single writer with async table 
services
   24/02/06 07:12:21 INFO HoodieIngestionService: Ingestion service starts 
running in continuous mode
   24/02/06 07:12:21 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from s3a://cdp-offline-xxx/huditream_rli_3
   24/02/06 07:12:21 INFO HoodieTableConfig: Loading table properties from 
s3a://cdp-offline-xxxdev/huditream_rli_3/.hoodie/hoodie.properties
   24/02/06 07:12:21 INFO HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from 
s3a://cdp-offline-xxx/huditream_rli_3
   24/02/06 07:12:21 INFO HoodieActiveTimeline: Loaded instants upto : 
Option{val=[20240205111704768__commit__COMPLETED__20240205111748000]}
   24/02/06 07:12:21 INFO StreamSync: Checkpoint to resume from : 
Option{val=spark_hudi_temp,0:732979,1:727818,2:725765,3:719464,4:721968,5:727487,6:737757,7:727566,8:736890,9:722032,10:723030,11:724587,12:723768,13:732789,14:721004,15:721541,16:734303,17:717704,18:734645,19:721914}
   24/02/06 07:12:21 INFO ConsumerConfig: ConsumerConfig values: 
        allow.auto.create.topics = true
        auto.commit.interval.ms = 5000
        auto.offset.reset = latest
        bootstrap.servers = local:9092
        check.crcs = true
        client.dns.lookup = use_all_dns_ips
        client.id = consumer-null-1
        client.rack = 
        connections.max.idle.ms = 540000
        default.api.timeout.ms = 60000
        enable.auto.commit = true
        exclude.internal.topics = true
        fetch.max.bytes = 52428800
        fetch.max.wait.ms = 500
        fetch.min.bytes = 1
        group.id = null
        group.instance.id = null
        heartbeat.interval.ms = 3000
        interceptor.classes = []
        internal.leave.group.on.close = true
        internal.throw.on.fetch.stable.offset.unsupported = false
        isolation.level = read_uncommitted
        key.deserializer = class 
org.apache.kafka.common.serialization.StringDeserializer
        max.partition.fetch.bytes = 1048576
        max.poll.interval.ms = 300000
        max.poll.records = 500
        metadata.max.age.ms = 300000
        metric.reporters = []
        metrics.num.samples = 2
        metrics.recording.level = INFO
        metrics.sample.window.ms = 30000
        partition.assignment.strategy = [class 
org.apache.kafka.clients.consumer.RangeAssignor]
        receive.buffer.bytes = 65536
        reconnect.backoff.max.ms = 1000
        reconnect.backoff.ms = 50
        request.timeout.ms = 30000
        retry.backoff.ms = 100
        sasl.client.callback.handler.class = class 
software.amazon.msk.auth.iam.IAMClientCallbackHandler
        sasl.jaas.config = [hidden]
        sasl.kerberos.kinit.cmd = /usr/bin/kinit
        sasl.kerberos.min.time.before.relogin = 60000
        sasl.kerberos.service.name = null
        sasl.kerberos.ticket.renew.jitter = 0.05
        sasl.kerberos.ticket.renew.window.factor = 0.8
        sasl.login.callback.handler.class = null
        sasl.login.class = null
        sasl.login.refresh.buffer.seconds = 300
        sasl.login.refresh.min.period.seconds = 60
        sasl.login.refresh.window.factor = 0.8
        sasl.login.refresh.window.jitter = 0.05
        sasl.mechanism = AWS_MSK_IAM
        security.protocol = SASL_SSL
        security.providers = null
        send.buffer.bytes = 131072
        session.timeout.ms = 10000
        socket.connection.setup.timeout.max.ms = 30000
        socket.connection.setup.timeout.ms = 10000
        ssl.cipher.suites = null
        ssl.enabled.protocols = [TLSv1.2, TLSv1.3]
        ssl.endpoint.identification.algorithm = https
        ssl.engine.factory.class = null
        ssl.key.password = null
        ssl.keymanager.algorithm = SunX509
        ssl.keystore.certificate.chain = null
        ssl.keystore.key = null
        ssl.keystore.location = null
        ssl.keystore.password = null
        ssl.keystore.type = JKS
        ssl.protocol = TLSv1.3
        ssl.provider = null
        ssl.secure.random.implementation = null
        ssl.trustmanager.algorithm = PKIX
        ssl.truststore.certificates = null
        ssl.truststore.location = null
        ssl.truststore.password = null
        ssl.truststore.type = JKS
        value.deserializer = class 
org.apache.kafka.common.serialization.StringDeserializer
   
   24/02/06 07:12:22 INFO AbstractLogin: Successfully logged in.
   24/02/06 07:12:22 INFO AppInfoParser: Kafka version: 2.8.0
   24/02/06 07:12:22 INFO AppInfoParser: Kafka commitId: ebb1d6e21cc92130
   24/02/06 07:12:22 INFO AppInfoParser: Kafka startTimeMs: 1707203542234
   24/02/06 07:12:23 INFO Metadata: [Consumer clientId=consumer-null-1, 
groupId=null] Cluster ID: H-NPJc0UTZ6XH3XCAnEDOw
   24/02/06 07:12:25 INFO Metrics: Metrics scheduler closed
   24/02/06 07:12:25 INFO Metrics: Closing reporter 
org.apache.kafka.common.metrics.JmxReporter
   24/02/06 07:12:25 INFO Metrics: Metrics reporters closed
   24/02/06 07:12:25 INFO AppInfoParser: App info kafka.consumer for 
consumer-null-1 unregistered
   24/02/06 07:12:25 INFO KafkaOffsetGen: SourceLimit not configured, set 
numEvents to default value : 5000000
   24/02/06 07:12:25 INFO KafkaOffsetGen: getNextOffsetRanges set config 
hoodie.streamer.source.kafka.minPartitions to 0
   
   Killed
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to