RE: How to share large resources like dictionaries while processing data with Spark ?

Evo Eftimov Fri, 05 Jun 2015 07:31:47 -0700

And RDD.lookup() can not be invoked from Transformations e.g. maps


Lookup() is an action which can be invoked only from the driver – if you want 
functionality like that from within Transformations executed on the cluster 
nodes try Indexed RDD

 

Other options are load a Batch / Static RDD once in your Spark Streaming App 
and then keep joining and then e.g. filtering every incoming DStream RDD with 
the (big static) Batch RDD

 

From: Evo Eftimov [mailto:[email protected]] 
Sent: Friday, June 5, 2015 3:27 PM
To: 'Dmitry Goldenberg'
Cc: 'Yiannis Gkoufas'; 'Olivier Girardot'; '[email protected]'
Subject: RE: How to share large resources like dictionaries while processing 
data with Spark ?

 

It is called Indexed RDD https://github.com/amplab/spark-indexedrdd 

 

From: Dmitry Goldenberg [mailto:[email protected]] 
Sent: Friday, June 5, 2015 3:15 PM
To: Evo Eftimov
Cc: Yiannis Gkoufas; Olivier Girardot; [email protected]
Subject: Re: How to share large resources like dictionaries while processing 
data with Spark ?

 

Thanks everyone. Evo, could you provide a link to the Lookup RDD project? I 
can't seem to locate it exactly on Github. (Yes, to your point, our project is 
Spark streaming based). Thank you.

 

On Fri, Jun 5, 2015 at 6:04 AM, Evo Eftimov <[email protected]> wrote:

Oops, @Yiannis, sorry to be a party pooper but the Job Server is for Spark 
Batch Jobs (besides anyone can put something like that in 5 min), while I am 
under the impression that Dmytiy is working on Spark Streaming app 

 

Besides the Job Server is essentially for sharing the Spark Context between 
multiple threads 

 

Re Dmytiis intial question – you can load large data sets as Batch (Static) RDD 
from any Spark Streaming App and then join DStream RDDs  against them to 
emulate “lookups” , you can also try the “Lookup RDD” – there is a git hub 
project

 

From: Dmitry Goldenberg [mailto:[email protected]] 
Sent: Friday, June 5, 2015 12:12 AM
To: Yiannis Gkoufas
Cc: Olivier Girardot; [email protected]
Subject: Re: How to share large resources like dictionaries while processing 
data with Spark ?

 

Thanks so much, Yiannis, Olivier, Huang!

 

On Thu, Jun 4, 2015 at 6:44 PM, Yiannis Gkoufas <[email protected]> wrote:

Hi there,

 

I would recommend checking out 
https://github.com/spark-jobserver/spark-jobserver which I think gives the 
functionality you are looking for.

I haven't tested it though.

 

BR

 

On 5 June 2015 at 01:35, Olivier Girardot <[email protected]> wrote:

You can use it as a broadcast variable, but if it's "too" large (more than 1Gb 
I guess), you may need to share it joining this using some kind of key to the 
other RDDs.

But this is the kind of thing broadcast variables were designed for.

 

Regards, 

 

Olivier.

 

Le jeu. 4 juin 2015 à 23:50, dgoldenberg <[email protected]> a écrit :

We have some pipelines defined where sometimes we need to load potentially
large resources such as dictionaries.

What would be the best strategy for sharing such resources among the
transformations/actions within a consumer?  Can they be shared somehow
across the RDD's?

I'm looking for a way to load such a resource once into the cluster memory
and have it be available throughout the lifecycle of a consumer...

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-large-resources-like-dictionaries-while-processing-data-with-Spark-tp23162.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: How to share large resources like dictionaries while processing data with Spark ?

Reply via email to