Moved to user list. I'm not really clear on what you're trying to accomplish (why put the csv file through Kafka instead of reading it directly with spark?)
auto.offset.reset=largest just means that when starting the job without any defined offsets, it will start at the highest (most recent) available offsets. That's probably not what you want if you've already loaded csv lines into kafka. On Wed, Nov 16, 2016 at 2:45 PM, Hoang Bao Thien <hbthien0...@gmail.com> wrote: > Hi all, > > I would like to ask a question related to the size of Kafka stream. I want > to put data (e.g., file *.csv) to Kafka then use Spark streaming to get the > output from Kafka and then save to Hive by using SparkSQL. The file csv is > about 100MB with ~250K messages/rows (Each row has about 10 fields of > integer). I see that Spark Streaming first received two partitions/batches, > the first is of 60K messages and the second is of 50K msgs. But from the > third batch, Spark just received 200 messages for each batch (or partition). > I think that this problem is coming from Kafka or some configuration in > Spark. I already tried to configure with the setting > "auto.offset.reset=largest", but every batch only gets 200 messages. > > Could you please tell me how to fix this problem? > Thank you so much. > > Best regards, > Alex > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org