Re: How to share large resources like dictionaries while processing data with Spark ?

Dmitry Goldenberg Thu, 04 Jun 2015 16:12:36 -0700

Thanks so much, Yiannis, Olivier, Huang!

On Thu, Jun 4, 2015 at 6:44 PM, Yiannis Gkoufas <johngou...@gmail.com>
wrote:


> Hi there,
>
> I would recommend checking out
> https://github.com/spark-jobserver/spark-jobserver which I think gives
> the functionality you are looking for.
> I haven't tested it though.
>
> BR
>
> On 5 June 2015 at 01:35, Olivier Girardot <ssab...@gmail.com> wrote:
>
>> You can use it as a broadcast variable, but if it's "too" large (more
>> than 1Gb I guess), you may need to share it joining this using some kind of
>> key to the other RDDs.
>> But this is the kind of thing broadcast variables were designed for.
>>
>> Regards,
>>
>> Olivier.
>>
>> Le jeu. 4 juin 2015 à 23:50, dgoldenberg <dgoldenberg...@gmail.com> a
>> écrit :
>>
>>> We have some pipelines defined where sometimes we need to load
>>> potentially
>>> large resources such as dictionaries.
>>>
>>> What would be the best strategy for sharing such resources among the
>>> transformations/actions within a consumer?  Can they be shared somehow
>>> across the RDD's?
>>>
>>> I'm looking for a way to load such a resource once into the cluster
>>> memory
>>> and have it be available throughout the lifecycle of a consumer...
>>>
>>> Thanks.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-large-resources-like-dictionaries-while-processing-data-with-Spark-tp23162.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>

Re: How to share large resources like dictionaries while processing data with Spark ?

Reply via email to