Re: Spark Streaming 1.3 & Kafka Direct Streams

Neelesh Wed, 01 Apr 2015 12:30:06 -0700

+1 for subclassing. its more flexible if we can  subclass the
implementation classes.
 On Apr 1, 2015 12:19 PM, "Cody Koeninger" <c...@koeninger.org> wrote:


> As I said in the original ticket, I think the implementation classes
> should be exposed so that people can subclass and override compute() to
> suit their needs.
>
> Just adding a function from Time => Set[TopicAndPartition] wouldn't be
> sufficient for some of my current production use cases.
>
> compute() isn't really a function from Time => Option[KafkaRDD], it's a
> function from (Time, current offsets, kafka metadata, etc) =>
> Option[KafkaRDD]
>
> I think it's more straightforward to give access to that additional state
> via subclassing than it is to add in more callbacks for every possible use
> case.
>
>
>
>
> On Wed, Apr 1, 2015 at 2:01 PM, Tathagata Das <t...@databricks.com> wrote:
>
>> We should be able to support that use case in the direct API. It may be
>> as simple as allowing the users to pass on a function that returns the set
>> of topic+partitions to read from.
>> That is function (Time) => Set[TopicAndPartition] This gets called every
>> batch interval before the offsets are decided. This would allow users to
>> add topics, delete topics, modify partitions on the fly.
>>
>> What do you think Cody?
>>
>>
>>
>>
>> On Wed, Apr 1, 2015 at 11:57 AM, Neelesh <neele...@gmail.com> wrote:
>>
>>> Thanks Cody!
>>>
>>> On Wed, Apr 1, 2015 at 11:21 AM, Cody Koeninger <c...@koeninger.org>
>>> wrote:
>>>
>>>> If you want to change topics from batch to batch, you can always just
>>>> create a KafkaRDD repeatedly.
>>>>
>>>> The streaming code as it stands assumes a consistent set of topics
>>>> though.  The implementation is private so you cant subclass it without
>>>> building your own spark.
>>>>
>>>> On Wed, Apr 1, 2015 at 1:09 PM, Neelesh <neele...@gmail.com> wrote:
>>>>
>>>>> Thanks Cody, that was really helpful.  I have a much better
>>>>> understanding now. One last question -  Kafka topics  are initialized once
>>>>> in the driver, is there an easy way of adding/removing topics on the fly?
>>>>> KafkaRDD#getPartitions() seems to be computed only once, and no way of
>>>>> refreshing them.
>>>>>
>>>>> Thanks again!
>>>>>
>>>>> On Wed, Apr 1, 2015 at 10:01 AM, Cody Koeninger <c...@koeninger.org>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
>>>>>>
>>>>>> The kafka consumers run in the executors.
>>>>>>
>>>>>> On Wed, Apr 1, 2015 at 11:18 AM, Neelesh <neele...@gmail.com> wrote:
>>>>>>
>>>>>>> With receivers, it was pretty obvious which code ran where - each
>>>>>>> receiver occupied a core and ran on the workers. However, with the new
>>>>>>> kafka direct input streams, its hard for me to understand where the code
>>>>>>> that's reading from kafka brokers runs. Does it run on the driver (I 
>>>>>>> hope
>>>>>>> not), or does it run on workers?
>>>>>>>
>>>>>>> Any help appreciated
>>>>>>> thanks!
>>>>>>> -neelesh
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Spark Streaming 1.3 & Kafka Direct Streams

Reply via email to