Re: Map Question

2015-04-23 Thread Vadim Bichutskiy
l.com] > *Sent: *Thursday, April 23, 2015 12:00 PM Eastern Standard Time > *To: *Tathagata Das > *Cc: *user@spark.apache.org > *Subject: *Re: Map Question > > Here it is. How do I access a broadcastVar in a function that's in another > module (process_stuff.py below): >

RE: Map Question

2015-04-23 Thread Ganelin, Ilya
Good (www.good.com) -Original Message- From: Vadim Bichutskiy [vadim.bichuts...@gmail.com<mailto:vadim.bichuts...@gmail.com>] Sent: Thursday, April 23, 2015 12:00 PM Eastern Standard Time To: Tathagata Das Cc: user@spark.apache.org Subject: Re: Map Question Here it is. How do

Re: Map Question

2015-04-23 Thread Vadim Bichutskiy
Here it is. How do I access a broadcastVar in a function that's in another module (process_stuff.py below): Thanks, Vadim main.py --- from pyspark import SparkContext, SparkConf from pyspark.streaming import StreamingContext from pyspark.sql import SQLContext from process_stuff import myfunc

Re: Map Question

2015-04-22 Thread Tathagata Das
Can you give full code? especially the myfunc? On Wed, Apr 22, 2015 at 2:20 PM, Vadim Bichutskiy < vadim.bichuts...@gmail.com> wrote: > Here's what I did: > > print 'BROADCASTING...' > broadcastVar = sc.broadcast(mylist) > print broadcastVar > print broadcastVar.value > print 'FINISHED BROADCASTI

Re: Map Question

2015-04-22 Thread Vadim Bichutskiy
Here's what I did: print 'BROADCASTING...' broadcastVar = sc.broadcast(mylist) print broadcastVar print broadcastVar.value print 'FINISHED BROADCASTING...' The above works fine, but when I call myrdd.map(myfunc) I get *NameError: global name 'broadcastVar' is not defined* The myfunc function is

Re: Map Question

2015-04-22 Thread Tathagata Das
Absolutely. The same code would work for local as well as distributed mode! On Wed, Apr 22, 2015 at 11:08 AM, Vadim Bichutskiy < vadim.bichuts...@gmail.com> wrote: > Can I use broadcast vars in local mode? > ᐧ > > On Wed, Apr 22, 2015 at 2:06 PM, Tathagata Das > wrote: > >> Yep. Not efficient. P

Re: Map Question

2015-04-22 Thread Vadim Bichutskiy
Can I use broadcast vars in local mode? ᐧ On Wed, Apr 22, 2015 at 2:06 PM, Tathagata Das wrote: > Yep. Not efficient. Pretty bad actually. That's why broadcast variable > were introduced right at the very beginning of Spark. > > > > On Wed, Apr 22, 2015 at 10:58 AM, Vadim Bichutskiy < > vadim.bi

Re: Map Question

2015-04-22 Thread Tathagata Das
Yep. Not efficient. Pretty bad actually. That's why broadcast variable were introduced right at the very beginning of Spark. On Wed, Apr 22, 2015 at 10:58 AM, Vadim Bichutskiy < vadim.bichuts...@gmail.com> wrote: > Thanks TD. I was looking into broadcast variables. > > Right now I am running it

Re: Map Question

2015-04-22 Thread Vadim Bichutskiy
Thanks TD. I was looking into broadcast variables. Right now I am running it locally...and I plan to move it to "production" on EC2. The way I fixed it is by doing myrdd.map(lambda x: (x, mylist)).map(myfunc) but I don't think it's efficient? mylist is filled only once at the start and never cha

Re: Map Question

2015-04-22 Thread Tathagata Das
Is the mylist present on every executor? If not, then you have to pass it on. And broadcasts are the best way to pass them on. But note that once broadcasted it will immutable at the executors, and if you update the list at the driver, you will have to broadcast it again. TD On Wed, Apr 22, 2015

Map Question

2015-04-22 Thread Vadim Bichutskiy
I am using Spark Streaming with Python. For each RDD, I call a map, i.e., myrdd.map(myfunc), myfunc is in a separate Python module. In yet another separate Python module I have a global list, i.e. mylist, that's populated with metadata. I can't get myfunc to see mylist...it's always empty. Alternat