rhyphenkumar opened a new issue, #9477:
URL: https://github.com/apache/seatunnel/issues/9477
I am trying to run a simple streaming job with spark engine which reads from
kafka and writes to s3 files. What i see is that the job gets stuck and never
move forward. The files also doesn't get written. I pushed data into kafka
after starting job, and i see that current offset for the consumer grouo moves
forward to latest offsets on kafka server but the data never gets written from
seatunnel. The logs also remain at same point forever.
Somehow, seatunnel documentation with spark streaming mode seem to be really
bad or missing. I didn't find a single example on how to run seatunnel with
spark engine for streaming. So even kafka read required a lot of hit and
trials. Following is my config
SeaTunnel Version
2.3.11
SeaTunnel Config
env {
parallelism = 2
job.mode = "STREAMING" // Change to BATCH for testing
checkpoint.interval = 2000
checkpoint.path = "file:///Users/b0279627/Downloads/kafka_streaming"
job.name = "SeaTunnel-spark-streaming2"
spark.executor.instances = 1
spark.executor.cores = 1
spark.executor.memory = "1g"
spark.master = local
spark.eventLog.enabled = "true"
spark.eventLog.dir = "file:///Users/b0279627/Downloads/spark/"
}
source {
Kafka {
schema = {
fields {
name = "string"
age = "int"
}
}
topic = "topic_streaming"
bootstrap.servers = "localhost:9092"
consumer.group = "seatunnel_batch_new4" // Use a unique consumer group
# Important: For Spark compatibility with "latest" mode
start_mode = "group_offsets"
# Spark Kafka specific configurations
kafka.config {
"auto.offset.reset" = "earliest"
"enable.auto.commit" = "true"
"max.poll.records" = "1000"
"session.timeout.ms" = "60000"
"heartbeat.interval.ms" = "15000"
"fetch.max.wait.ms" = "5000"
"auto.commit.interval.ms"="50"
}
}
sink {
S3File {
parallelism = 18
bucket = "s3a://instance1-bucket"
path = "/datalake_staging/seatunnel_test/spark_2/kafka_streaming"
fs.s3a.endpoint = "http://localhost:9500"
fs.s3a.aws.credentials.provider =
"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider"
file_format_type = "parquet"
access_key = "sea_tunnnel"
secret_key = "sea_tunnel"
# Enhanced Hadoop S3 configuration for Spark
hadoop_s3_properties {
"fs.s3a.path.style.access" = "true"
"fs.s3a.impl" = "org.apache.hadoop.fs.s3a.S3AFileSystem"
"fs.s3a.connection.ssl.enabled" = "false"
"fs.s3a.fast.upload" = "true"
"fs.s3a.connection.maximum" = "100"
"fs.s3a.attempts.maximum" = "20"
"fs.s3a.connection.timeout" = "300000"
"fs.s3a.multipart.size" = "5242880"
}
# Important for Spark to correctly handle the data
save_mode = "append"
}
}
Running Command
./bin/start-seatunnel-spark-3-connector-v2.sh \ --master yarn --deploy-mode
client --config ./config/
Is spark streaming supported in seatunnel ? If there are any example
configs, please point me to there.
Also, if i want to know which sink connectors are supported with spark
structured streaming in seatunnel, how can i know that ? I didn't find this
information any where in seatunnel documentation or code.
What will be behaviour of seatunnel if job mode is selected streaming but
the source connector doesn;t support streaming ? will it throw error or run job
in batch mode ? I tried with sftp file source, and it never ran in streaming
mode and by default it went into batch mode.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]