Hi all,

I would like to ask a question related to the size of Kafka stream. I want
to put data (e.g., file *.csv) to Kafka then use Spark streaming to get the
output from Kafka and then save to Hive by using SparkSQL. The file csv is
about 100MB with ~250K messages/rows (Each row has about 10 fields of
integer). I see that Spark Streaming first received two partitions/batches,
the first is of 60K messages and the second is of 50K msgs. But from the
third batch, Spark just received 200 messages for each batch (or partition).
I think that this problem is coming from Kafka or some configuration in
Spark. I already tried to configure with the setting
"auto.offset.reset=largest", but every batch only gets 200 messages.

Could you please tell me how to fix this problem?
Thank you so much.

Best regards,
Alex

Reply via email to