rhyphenkumar opened a new issue, #9477:
URL: https://github.com/apache/seatunnel/issues/9477

   I am trying to run a simple streaming job with spark engine which reads from 
kafka and writes to s3 files. What i see is that the job gets stuck and never 
move forward. The files also doesn't get written. I pushed data into kafka 
after starting job, and i see that current offset for the consumer grouo moves 
forward to latest offsets on kafka server but the data never gets written from 
seatunnel. The logs also remain at same point forever.
   Somehow, seatunnel documentation with spark streaming mode seem to be really 
bad or missing. I didn't find a single example on how to run seatunnel with 
spark engine for streaming. So even kafka read required a lot of hit and 
trials. Following is my config
   
   SeaTunnel Version
   2.3.11
   
   SeaTunnel Config
   env {
   parallelism = 2
   job.mode = "STREAMING" // Change to BATCH for testing
   checkpoint.interval = 2000
   checkpoint.path = "file:///Users/b0279627/Downloads/kafka_streaming"
   job.name = "SeaTunnel-spark-streaming2"
   spark.executor.instances = 1
   spark.executor.cores = 1
   spark.executor.memory = "1g"
   spark.master = local
   spark.eventLog.enabled = "true"
   spark.eventLog.dir = "file:///Users/b0279627/Downloads/spark/"
   }
   source {
   Kafka {
   schema = {
   fields {
   name = "string"
   age = "int"
   }
   }
   topic = "topic_streaming"
   bootstrap.servers = "localhost:9092"
   consumer.group = "seatunnel_batch_new4" // Use a unique consumer group
   
   # Important: For Spark compatibility with "latest" mode
   start_mode = "group_offsets"
   
   # Spark Kafka specific configurations
   kafka.config {
     "auto.offset.reset" = "earliest"
     "enable.auto.commit" = "true"
     "max.poll.records" = "1000"
     "session.timeout.ms" = "60000"
     "heartbeat.interval.ms" = "15000"
     "fetch.max.wait.ms" = "5000"
     "auto.commit.interval.ms"="50"
   }
   }
   
   sink {
   S3File {
   parallelism = 18
   bucket = "s3a://instance1-bucket"
   path = "/datalake_staging/seatunnel_test/spark_2/kafka_streaming"
   fs.s3a.endpoint = "http://localhost:9500";
   fs.s3a.aws.credentials.provider = 
"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider"
   file_format_type = "parquet"
   access_key = "sea_tunnnel"
   secret_key = "sea_tunnel"
   
   # Enhanced Hadoop S3 configuration for Spark
   hadoop_s3_properties {
     "fs.s3a.path.style.access" = "true"
     "fs.s3a.impl" = "org.apache.hadoop.fs.s3a.S3AFileSystem"
     "fs.s3a.connection.ssl.enabled" = "false"
     "fs.s3a.fast.upload" = "true"
     "fs.s3a.connection.maximum" = "100"
     "fs.s3a.attempts.maximum" = "20"
     "fs.s3a.connection.timeout" = "300000"
     "fs.s3a.multipart.size" = "5242880"
   }
   
   # Important for Spark to correctly handle the data
   save_mode = "append"
   }
   }
   
   Running Command
   ./bin/start-seatunnel-spark-3-connector-v2.sh \ --master yarn --deploy-mode 
client --config ./config/
   
   Is spark streaming supported in seatunnel ? If there are any example 
configs, please point me to there.
   Also, if i want to know which sink connectors are supported with spark 
structured streaming in seatunnel, how can i know that ? I didn't find this 
information any where in seatunnel documentation or code.
   What will be behaviour of seatunnel if job mode is selected streaming but 
the source connector doesn;t support streaming ? will it throw error or run job 
in batch mode ? I tried with sftp file source, and it never ran in streaming 
mode and by default it went into batch mode.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to