Yes, when old broacast objects are not referenced any more in the driver, then associated data in the driver AND the executors will get cleared.
On Mon, Oct 5, 2015 at 1:40 PM, Olivier Girardot < [email protected]> wrote: > @td does that mean that the "old" broadcasted data will in any way be > "garbage collected" at some point if no RDD or transformation is using it > anymore ? > > Regards, > > Olivier. > > 2015-04-09 21:49 GMT+02:00 Amit Assudani <[email protected]>: > >> Thanks a lot TD for detailed answers. The answers lead to few more >> questions, >> >> >> 1. "the transform RDD-to-RDD function runs on the driver “ - I didn’t >> understand this, does it mean when I use transform function on DStream, it >> is not parallelized, surely I m missing something here. >> 2. updateStateByKey I think won’t work in this use case, I have >> three separate attribute streams ( with different frequencies ) make up >> the >> combined state ( i.e. Entity ) at point in time on which I want to do some >> processing. Do you think otherwise ? >> 3. transform+join seems only option so far, but any guestimate how >> would this perform/ react on cluster ? Assuming, master data in 100s of >> Gbs, and join is based on some row key. We are talking about slice of >> stream data to be joined with 100s of Gbs of master data continuously. Is >> it something can be done but should not be done ? >> >> Regards, >> Amit >> >> From: Tathagata Das <[email protected]> >> Date: Thursday, April 9, 2015 at 3:13 PM >> To: amit assudani <[email protected]> >> Cc: "[email protected]" <[email protected]> >> Subject: Re: Lookup / Access of master data in spark streaming >> >> Responses inline. Hope they help. >> >> On Thu, Apr 9, 2015 at 8:20 AM, Amit Assudani <[email protected]> >> wrote: >> >>> Hi Friends, >>> >>> I am trying to solve a use case in spark streaming, I need help on >>> getting to right approach on lookup / update the master data. >>> >>> Use case ( simplified ) >>> I’ve a dataset of entity with three attributes and identifier/row key in >>> a persistent store. >>> >>> Each attribute along with row key come from a different stream let’s >>> say, effectively 3 source streams. >>> >>> Now whenever any attribute comes up, I want to update/sync the >>> persistent store and do some processing, but the processing would require >>> the latest state of entity with latest values of three attributes. >>> >>> I wish if I have the all the entities cached in some sort of centralized >>> cache ( like we have data in hdfs ) within spark streaming which may be >>> used for data local processing. But I assume there is no such thing. >>> >>> potential approaches I m thinking of, I suspect first two are not >>> feasible, but I want to confirm, >>> 1. Is Broadcast Variables mutable ? If yes, can I use it as cache >>> for all entities sizing around 100s of GBs provided i have a cluster with >>> enough RAM. >>> >> >> Broadcast variables are not mutable. But you can always create a new >> broadcast variable when you want and use the "latest" broadcast variable in >> your computation. >> >> dstream.transform { rdd => >> >> val latestBroacast = getLatestBroadcastVariable() // fetch existing >> or update+create new and return >> val transformedRDD = rdd. ...... // use latestBroacast in RDD >> tranformations >> transformedRDD >> } >> >> Since the transform RDD-to-RDD function runs on the driver every batch >> interval, it will always use the latest broadcast variable that you want. >> Though note that whenever you create a new broadcast, the next batch may >> take a little longer to as the data needs to be actually broadcasted out. >> That can also be made asynchronous by running a simple task (to force the >> broadcasting out) on any new broadcast variable in a different thread as >> Spark Streaming batch schedule, but using the same underlying Spark Context. >> >> >> >>> >>> 1. Is there any kind of sticky partition possible, so that I route >>> my stream data to go through the same node where I've the corresponding >>> entities, subset of entire store, cached in memory within JVM / off heap >>> on >>> the node, this would avoid lookups from store. >>> >>> You could use updateStateByKey. That is quite sticky, but does not >> eliminate the possibility that it can run on a different node. In fact this >> is necessary for fault-tolerance - what if the node it was supposed to run >> goes down? The task will be run on a different node, and you have to >> design your application such that it can handle that. >> >> >>> 1. If I stream the entities from persistent store into engine, this >>> becomes 4th stream - the entity stream, how do i use join / merge to >>> enable >>> stream 1,2,3 to lookup and update the data from stream 4. Would >>> DStream.join work for few seconds worth of data in attribute streams with >>> all data in entity stream ? Or do I use transform and within that use rdd >>> join, I’ve doubts if I am leaning towards core spark approach in spark >>> streaming ? >>> >>> >> Depends on what kind of join! If you want the join every batch in stream >> with a static data set (or rarely updated dataset), the transform+join is >> the way to go. If you want to join one stream with a window of data from >> another stream, then DStream.join is the way to go. >> >>> >>> 1. >>> >>> >>> 1. The last approach, which i think will surely work but i want to >>> avoid, is i keep the entities in IMDB and do lookup/update calls on from >>> stream 1,2 and 3. >>> >>> >>> Any help is deeply appreciated as this would help me design my system >>> efficiently and the solution approach may become a beacon for lookup master >>> data sort of problems. >>> >>> Regards, >>> Amit >>> >>> ------------------------------ >>> >>> >>> >>> >>> >>> >>> NOTE: This message may contain information that is confidential, >>> proprietary, privileged or otherwise protected by law. The message is >>> intended solely for the named addressee. If received in error, please >>> destroy and notify the sender. Any use of this email is prohibited when >>> received in error. Impetus does not represent, warrant and/or guarantee, >>> that the integrity of this communication has been maintained nor that the >>> communication is free of errors, virus, interception or interference. >>> >> >> >> ------------------------------ >> >> >> >> >> >> >> NOTE: This message may contain information that is confidential, >> proprietary, privileged or otherwise protected by law. The message is >> intended solely for the named addressee. If received in error, please >> destroy and notify the sender. Any use of this email is prohibited when >> received in error. Impetus does not represent, warrant and/or guarantee, >> that the integrity of this communication has been maintained nor that the >> communication is free of errors, virus, interception or interference. >> > > > > -- > *Olivier Girardot* | Associé > [email protected] > +33 6 24 09 17 94 >
