Re: Lookup / Access of master data in spark streaming

Tathagata Das Mon, 05 Oct 2015 14:51:01 -0700

Yes, when old broacast objects are not referenced any more in the driver,
then associated data in the driver AND the executors will get cleared.


On Mon, Oct 5, 2015 at 1:40 PM, Olivier Girardot <
[email protected]> wrote:

> @td does that mean that the "old" broadcasted data will in any way be
> "garbage collected" at some point if no RDD or transformation is using it
> anymore ?
>
> Regards,
>
> Olivier.
>
> 2015-04-09 21:49 GMT+02:00 Amit Assudani <[email protected]>:
>
>> Thanks a lot TD for detailed answers. The answers lead to few more
>> questions,
>>
>>
>>    1. "the transform RDD-to-RDD function runs on the driver “ - I didn’t
>>    understand this, does it mean when I use transform function on DStream, it
>>    is not parallelized, surely I m missing something here.
>>    2.  updateStateByKey I think won’t work in this use case,  I have
>>    three separate attribute streams ( with different frequencies ) make up 
>> the
>>    combined state ( i.e. Entity ) at point in time on which I want to do some
>>    processing. Do you think otherwise ?
>>    3. transform+join seems only option so far, but any guestimate how
>>    would this perform/ react on cluster ? Assuming, master data in 100s of
>>    Gbs, and join is based on some row key. We are talking about slice of
>>    stream data to be joined with 100s of Gbs of master data continuously. Is
>>    it something can be done but should not be done ?
>>
>> Regards,
>> Amit
>>
>> From: Tathagata Das <[email protected]>
>> Date: Thursday, April 9, 2015 at 3:13 PM
>> To: amit assudani <[email protected]>
>> Cc: "[email protected]" <[email protected]>
>> Subject: Re: Lookup / Access of master data in spark streaming
>>
>> Responses inline. Hope they help.
>>
>> On Thu, Apr 9, 2015 at 8:20 AM, Amit Assudani <[email protected]>
>> wrote:
>>
>>> Hi Friends,
>>>
>>> I am trying to solve a use case in spark streaming, I need help on
>>> getting to right approach on lookup / update the master data.
>>>
>>> Use case ( simplified )
>>> I’ve a dataset of entity with three attributes and identifier/row key in
>>> a persistent store.
>>>
>>> Each attribute along with row key come from a different stream let’s
>>> say, effectively 3 source streams.
>>>
>>> Now whenever any attribute comes up, I want to update/sync the
>>> persistent store and do some processing, but the processing would require
>>> the latest state of entity with latest values of three attributes.
>>>
>>> I wish if I have the all the entities cached in some sort of centralized
>>> cache ( like we have data in hdfs ) within spark streaming which may be
>>> used for data local processing. But I assume there is no such thing.
>>>
>>> potential approaches I m thinking of, I suspect first two are not
>>> feasible, but I want to confirm,
>>>       1.  Is Broadcast Variables mutable ? If yes, can I use it as cache
>>> for all entities sizing  around 100s of GBs provided i have a cluster with
>>> enough RAM.
>>>
>>
>> Broadcast variables are not mutable. But you can always create a new
>> broadcast variable when you want and use the "latest" broadcast variable in
>> your computation.
>>
>> dstream.transform { rdd =>
>>
>>    val latestBroacast = getLatestBroadcastVariable()  // fetch existing
>> or update+create new and return
>>    val transformedRDD = rdd. ......  // use  latestBroacast in RDD
>> tranformations
>>    transformedRDD
>> }
>>
>> Since the transform RDD-to-RDD function runs on the driver every batch
>> interval, it will always use the latest broadcast variable that you want.
>> Though note that whenever you create a new broadcast, the next batch may
>> take a little longer to as the data needs to be actually broadcasted out.
>> That can also be made asynchronous by running a simple task (to force the
>> broadcasting out) on any new broadcast variable in a different thread as
>> Spark Streaming batch schedule, but using the same underlying Spark Context.
>>
>>
>>
>>>
>>>    1. Is there any kind of sticky partition possible, so that I route
>>>    my stream data to go through the same node where I've the corresponding
>>>    entities, subset of entire store, cached in memory within JVM / off heap 
>>> on
>>>    the node, this would avoid lookups from store.
>>>
>>> You could use updateStateByKey. That is quite sticky, but does not
>> eliminate the possibility that it can run on a different node. In fact this
>> is necessary for fault-tolerance - what if the node it was supposed to run
>> goes down? The task will be run on a different node, and you have to
>>  design your application such that it can handle that.
>>
>>
>>>    1. If I stream the entities from persistent store into engine, this
>>>    becomes 4th stream - the entity stream, how do i use join / merge to 
>>> enable
>>>    stream 1,2,3 to lookup and update the data from stream 4. Would
>>>    DStream.join work for few seconds worth of data in attribute streams with
>>>    all data in entity stream ? Or do I use transform and within that use rdd
>>>    join, I’ve doubts if I am leaning towards core spark approach in spark
>>>    streaming ?
>>>
>>>
>> Depends on what kind of join! If you want the join every batch in stream
>> with a static data set (or rarely updated dataset), the transform+join is
>> the way to go. If you want to join one stream with a window of data from
>> another stream, then DStream.join is the way to go.
>>
>>>
>>>    1.
>>>
>>>
>>>    1. The last approach, which i think will surely work but i want to
>>>    avoid, is i keep the entities in IMDB and do lookup/update calls on from
>>>    stream 1,2 and 3.
>>>
>>>
>>> Any help is deeply appreciated as this would help me design my system
>>> efficiently and the solution approach may become a beacon for lookup master
>>> data sort of problems.
>>>
>>> Regards,
>>> Amit
>>>
>>> ------------------------------
>>>
>>>
>>>
>>>
>>>
>>>
>>> NOTE: This message may contain information that is confidential,
>>> proprietary, privileged or otherwise protected by law. The message is
>>> intended solely for the named addressee. If received in error, please
>>> destroy and notify the sender. Any use of this email is prohibited when
>>> received in error. Impetus does not represent, warrant and/or guarantee,
>>> that the integrity of this communication has been maintained nor that the
>>> communication is free of errors, virus, interception or interference.
>>>
>>
>>
>> ------------------------------
>>
>>
>>
>>
>>
>>
>> NOTE: This message may contain information that is confidential,
>> proprietary, privileged or otherwise protected by law. The message is
>> intended solely for the named addressee. If received in error, please
>> destroy and notify the sender. Any use of this email is prohibited when
>> received in error. Impetus does not represent, warrant and/or guarantee,
>> that the integrity of this communication has been maintained nor that the
>> communication is free of errors, virus, interception or interference.
>>
>
>
>
> --
> *Olivier Girardot* | Associé
> [email protected]
> +33 6 24 09 17 94
>

Re: Lookup / Access of master data in spark streaming

Reply via email to