Re: KafkaInputDStream mapping of partitions to tasks

Evgeny Shishkin Thu, 27 Mar 2014 15:56:25 -0700

On 28 Mar 2014, at 01:38, Evgeny Shishkin <itparan...@gmail.com> wrote:


> 
> On 28 Mar 2014, at 01:32, Tathagata Das <tathagata.das1...@gmail.com> wrote:
> 
>> Yes, no one has reported this issue before. I just opened a JIRA on what I 
>> think is the main problem here
>> https://spark-project.atlassian.net/browse/SPARK-1340
>> Some of the receivers dont get restarted. 
>> I have a bunch refactoring in the NetworkReceiver ready to be posted as a PR 
>> that should fix this. 
>> 

Regarding this Jira
by default spark commits offsets to zookeeper every so seconds.
Even if you fix reconnect to kafka, we do not know from which offsets it will 
begin to consume.
So it would not recompute rdd as it should. It will receive arbitrary data. 
From the past, or from the future.
With high-level consumer we just do not have control over this.

Hith-level consumer should not be used in production with spark. Period.
Spark should use low-level consumer, control offsets and partition assignment 
deterministically. 
Like storm does.

>> Regarding the second problem, I have been thinking of adding flow control 
>> (i.e. limiting the rate of receiving) for a while, just havent gotten around 
>> to it. 
>> I added another JIRA for that for tracking this issue.
>> https://spark-project.atlassian.net/browse/SPARK-1341
>> 
>> 

I think if we have fixed kafka input like above. We can control such window 
automatically — like tcp window, slow start, and such.
But it will be great to have some fix available now anyway.


> 
> Thank you, i will participate and can provide testing of new code.
> Sorry for capslock, i just debugged this whole day, literally. 
> 
> 
>> TD
>> 
>> 
>> On Thu, Mar 27, 2014 at 3:23 PM, Evgeny Shishkin <itparan...@gmail.com> 
>> wrote:
>> 
>> On 28 Mar 2014, at 01:11, Scott Clasen <scott.cla...@gmail.com> wrote:
>> 
>> > Evgeniy Shishkin wrote
>> >> So, at the bottom — kafka input stream just does not work.
>> >
>> >
>> > That was the conclusion I was coming to as well.  Are there open tickets
>> > around fixing this up?
>> >
>> 
>> I am not aware of such. Actually nobody complained on spark+kafka before.
>> So i thought it just works, and then we tried to build something on it and 
>> almost failed.
>> 
>> I think that it is possible to steal/replicate how twitter storm works with 
>> kafka.
>> They do manual partition assignment, at least this would help to balance 
>> load.
>> 
>> There is another issue.
>> ssc batch creates new rdds every batch duration, always, even it previous 
>> computation did not finish.
>> 
>> But with kafka, we can consume more rdds later, after we finish previous 
>> rdds.
>> That way it would be much much simpler to not get OOM’ed when starting from 
>> beginning,
>> because we can consume many data from kafka during batch duration and then 
>> get oom.
>> 
>> But we just can not start slow, can not limit how many to consume during 
>> batch.
>> 
>> 
>> >
>> > --
>> > View this message in context: 
>> > http://apache-spark-user-list.1001560.n3.nabble.com/KafkaInputDStream-mapping-of-partitions-to-tasks-tp3360p3379.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> 
>

Re: KafkaInputDStream mapping of partitions to tasks

Reply via email to