Re: How to share large resources like dictionaries while processing data with Spark ?

Dmitry Goldenberg Fri, 05 Jun 2015 07:15:49 -0700

Thanks everyone. Evo, could you provide a link to the Lookup RDD project? I
can't seem to locate it exactly on Github. (Yes, to your point, our project
is Spark streaming based). Thank you.


On Fri, Jun 5, 2015 at 6:04 AM, Evo Eftimov <evo.efti...@isecc.com> wrote:

> Oops, @Yiannis, sorry to be a party pooper but the Job Server is for Spark
> Batch Jobs (besides anyone can put something like that in 5 min), while I
> am under the impression that Dmytiy is working on Spark Streaming app
>
>
>
> Besides the Job Server is essentially for sharing the Spark Context
> between multiple threads
>
>
>
> Re Dmytiis intial question – you can load large data sets as Batch
> (Static) RDD from any Spark Streaming App and then join DStream RDDs
> against them to emulate “lookups” , you can also try the “Lookup RDD” –
> there is a git hub project
>
>
>
> *From:* Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com]
> *Sent:* Friday, June 5, 2015 12:12 AM
> *To:* Yiannis Gkoufas
> *Cc:* Olivier Girardot; user@spark.apache.org
> *Subject:* Re: How to share large resources like dictionaries while
> processing data with Spark ?
>
>
>
> Thanks so much, Yiannis, Olivier, Huang!
>
>
>
> On Thu, Jun 4, 2015 at 6:44 PM, Yiannis Gkoufas <johngou...@gmail.com>
> wrote:
>
> Hi there,
>
>
>
> I would recommend checking out
> https://github.com/spark-jobserver/spark-jobserver which I think gives
> the functionality you are looking for.
>
> I haven't tested it though.
>
>
>
> BR
>
>
>
> On 5 June 2015 at 01:35, Olivier Girardot <ssab...@gmail.com> wrote:
>
> You can use it as a broadcast variable, but if it's "too" large (more than
> 1Gb I guess), you may need to share it joining this using some kind of key
> to the other RDDs.
>
> But this is the kind of thing broadcast variables were designed for.
>
>
>
> Regards,
>
>
>
> Olivier.
>
>
>
> Le jeu. 4 juin 2015 à 23:50, dgoldenberg <dgoldenberg...@gmail.com> a
> écrit :
>
> We have some pipelines defined where sometimes we need to load potentially
> large resources such as dictionaries.
>
> What would be the best strategy for sharing such resources among the
> transformations/actions within a consumer?  Can they be shared somehow
> across the RDD's?
>
> I'm looking for a way to load such a resource once into the cluster memory
> and have it be available throughout the lifecycle of a consumer...
>
> Thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-large-resources-like-dictionaries-while-processing-data-with-Spark-tp23162.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>

Re: How to share large resources like dictionaries while processing data with Spark ?

Reply via email to