Hi Cody,

I think that I misused the term 'data locality'. I think I should better
call it "node affinity"  instead, as this is what I would like to have:
For as long as an executor is available, I would like to have the same
kafka partition processed by the same node in order to take advantage of
local in-memory structures.

In the receiver-based mode this was a given. Any ideas how to achieve that
with the direct stream approach?

-greetz, Gerard.


On Wed, Oct 14, 2015 at 4:31 PM, Cody Koeninger <c...@koeninger.org> wrote:

> Assumptions about locality in spark are not very reliable, regardless of
> what consumer you use.  Even if you have locality preferences, and locality
> wait turned up really high, you still have to account for losing executors.
>
> On Wed, Oct 14, 2015 at 8:23 AM, Gerard Maas <gerard.m...@gmail.com>
> wrote:
>
>> Thanks Saisai, Mishra,
>>
>> Indeed, that hint will only work on a case where the Spark executor is
>> co-located with the Kafka broker.
>> I think the answer to my question as stated  is that there's no warranty
>> of where the task will execute as it will depend on the scheduler and
>> cluster resources available  (Mesos in our case).
>> Therefore, any assumptions made about data locality using the
>> consumer-based approach need to be reconsidered when migrating to the
>> direct stream.
>>
>> ((In our case, we were using local caches to decide when a given
>> secondary index for a record should be produced and written.))
>>
>> -kr, Gerard.
>>
>>
>>
>>
>> On Wed, Oct 14, 2015 at 2:58 PM, Saisai Shao <sai.sai.s...@gmail.com>
>> wrote:
>>
>>> This preferred locality is a hint to spark to schedule Kafka tasks on
>>> the preferred nodes, if Kafka and Spark are two separate cluster, obviously
>>> this locality hint takes no effect, and spark will schedule tasks following
>>> node-local -> rack-local -> any pattern, like any other spark tasks.
>>>
>>> On Wed, Oct 14, 2015 at 8:10 PM, Rishitesh Mishra <rmis...@snappydata.io
>>> > wrote:
>>>
>>>> Hi Gerard,
>>>> I am also trying to understand the same issue. Whatever code I have
>>>> seen it looks like once Kafka RDD is constructed the execution of that RDD
>>>> is upto the task scheduler and it can schedule the partitions based on the
>>>> load on nodes. There is preferred node specified in Kafks RDD. But ASFIK it
>>>> maps to the Kafka partitions host . So if Kafka and Spark are co hosted
>>>> probably this will work. If not, I am not sure how to get data locality for
>>>> a partition.
>>>> Others,
>>>> correct me if there is a way.
>>>>
>>>> On Wed, Oct 14, 2015 at 3:08 PM, Gerard Maas <gerard.m...@gmail.com>
>>>> wrote:
>>>>
>>>>> In the receiver-based kafka streaming model, given that each receiver
>>>>> starts as a long-running task, one can rely in a certain degree of data
>>>>> locality based on the kafka partitioning:  Data published on a given
>>>>> topic/partition will land on the same spark streaming receiving node until
>>>>> the receiver dies and needs to be restarted somewhere else.
>>>>>
>>>>> As I understand, the direct-kafka streaming model just computes
>>>>> offsets and relays the work to a KafkaRDD. How is the execution locality
>>>>> compared to the receiver-based approach?
>>>>>
>>>>> thanks, Gerard.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Regards,
>>>> Rishitesh Mishra,
>>>> SnappyData . (http://www.snappydata.io/)
>>>>
>>>> https://in.linkedin.com/in/rishiteshmishra
>>>>
>>>
>>>
>>
>

Reply via email to