+1 for subclassing. its more flexible if we can subclass the implementation classes. On Apr 1, 2015 12:19 PM, "Cody Koeninger" <c...@koeninger.org> wrote:
> As I said in the original ticket, I think the implementation classes > should be exposed so that people can subclass and override compute() to > suit their needs. > > Just adding a function from Time => Set[TopicAndPartition] wouldn't be > sufficient for some of my current production use cases. > > compute() isn't really a function from Time => Option[KafkaRDD], it's a > function from (Time, current offsets, kafka metadata, etc) => > Option[KafkaRDD] > > I think it's more straightforward to give access to that additional state > via subclassing than it is to add in more callbacks for every possible use > case. > > > > > On Wed, Apr 1, 2015 at 2:01 PM, Tathagata Das <t...@databricks.com> wrote: > >> We should be able to support that use case in the direct API. It may be >> as simple as allowing the users to pass on a function that returns the set >> of topic+partitions to read from. >> That is function (Time) => Set[TopicAndPartition] This gets called every >> batch interval before the offsets are decided. This would allow users to >> add topics, delete topics, modify partitions on the fly. >> >> What do you think Cody? >> >> >> >> >> On Wed, Apr 1, 2015 at 11:57 AM, Neelesh <neele...@gmail.com> wrote: >> >>> Thanks Cody! >>> >>> On Wed, Apr 1, 2015 at 11:21 AM, Cody Koeninger <c...@koeninger.org> >>> wrote: >>> >>>> If you want to change topics from batch to batch, you can always just >>>> create a KafkaRDD repeatedly. >>>> >>>> The streaming code as it stands assumes a consistent set of topics >>>> though. The implementation is private so you cant subclass it without >>>> building your own spark. >>>> >>>> On Wed, Apr 1, 2015 at 1:09 PM, Neelesh <neele...@gmail.com> wrote: >>>> >>>>> Thanks Cody, that was really helpful. I have a much better >>>>> understanding now. One last question - Kafka topics are initialized once >>>>> in the driver, is there an easy way of adding/removing topics on the fly? >>>>> KafkaRDD#getPartitions() seems to be computed only once, and no way of >>>>> refreshing them. >>>>> >>>>> Thanks again! >>>>> >>>>> On Wed, Apr 1, 2015 at 10:01 AM, Cody Koeninger <c...@koeninger.org> >>>>> wrote: >>>>> >>>>>> >>>>>> https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md >>>>>> >>>>>> The kafka consumers run in the executors. >>>>>> >>>>>> On Wed, Apr 1, 2015 at 11:18 AM, Neelesh <neele...@gmail.com> wrote: >>>>>> >>>>>>> With receivers, it was pretty obvious which code ran where - each >>>>>>> receiver occupied a core and ran on the workers. However, with the new >>>>>>> kafka direct input streams, its hard for me to understand where the code >>>>>>> that's reading from kafka brokers runs. Does it run on the driver (I >>>>>>> hope >>>>>>> not), or does it run on workers? >>>>>>> >>>>>>> Any help appreciated >>>>>>> thanks! >>>>>>> -neelesh >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >