l.com]
> *Sent: *Thursday, April 23, 2015 12:00 PM Eastern Standard Time
> *To: *Tathagata Das
> *Cc: *user@spark.apache.org
> *Subject: *Re: Map Question
>
> Here it is. How do I access a broadcastVar in a function that's in another
> module (process_stuff.py below):
>
Good (www.good.com)
-Original Message-
From: Vadim Bichutskiy
[vadim.bichuts...@gmail.com<mailto:vadim.bichuts...@gmail.com>]
Sent: Thursday, April 23, 2015 12:00 PM Eastern Standard Time
To: Tathagata Das
Cc: user@spark.apache.org
Subject: Re: Map Question
Here it is. How do
Here it is. How do I access a broadcastVar in a function that's in another
module (process_stuff.py below):
Thanks,
Vadim
main.py
---
from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext
from pyspark.sql import SQLContext
from process_stuff import myfunc
Can you give full code? especially the myfunc?
On Wed, Apr 22, 2015 at 2:20 PM, Vadim Bichutskiy <
vadim.bichuts...@gmail.com> wrote:
> Here's what I did:
>
> print 'BROADCASTING...'
> broadcastVar = sc.broadcast(mylist)
> print broadcastVar
> print broadcastVar.value
> print 'FINISHED BROADCASTI
Here's what I did:
print 'BROADCASTING...'
broadcastVar = sc.broadcast(mylist)
print broadcastVar
print broadcastVar.value
print 'FINISHED BROADCASTING...'
The above works fine,
but when I call myrdd.map(myfunc) I get *NameError: global name
'broadcastVar' is not defined*
The myfunc function is
Absolutely. The same code would work for local as well as distributed mode!
On Wed, Apr 22, 2015 at 11:08 AM, Vadim Bichutskiy <
vadim.bichuts...@gmail.com> wrote:
> Can I use broadcast vars in local mode?
> ᐧ
>
> On Wed, Apr 22, 2015 at 2:06 PM, Tathagata Das
> wrote:
>
>> Yep. Not efficient. P
Can I use broadcast vars in local mode?
ᐧ
On Wed, Apr 22, 2015 at 2:06 PM, Tathagata Das wrote:
> Yep. Not efficient. Pretty bad actually. That's why broadcast variable
> were introduced right at the very beginning of Spark.
>
>
>
> On Wed, Apr 22, 2015 at 10:58 AM, Vadim Bichutskiy <
> vadim.bi
Yep. Not efficient. Pretty bad actually. That's why broadcast variable were
introduced right at the very beginning of Spark.
On Wed, Apr 22, 2015 at 10:58 AM, Vadim Bichutskiy <
vadim.bichuts...@gmail.com> wrote:
> Thanks TD. I was looking into broadcast variables.
>
> Right now I am running it
Thanks TD. I was looking into broadcast variables.
Right now I am running it locally...and I plan to move it to "production"
on EC2.
The way I fixed it is by doing myrdd.map(lambda x: (x, mylist)).map(myfunc)
but I don't think it's efficient?
mylist is filled only once at the start and never cha
Is the mylist present on every executor? If not, then you have to pass it
on. And broadcasts are the best way to pass them on. But note that once
broadcasted it will immutable at the executors, and if you update the list
at the driver, you will have to broadcast it again.
TD
On Wed, Apr 22, 2015
I am using Spark Streaming with Python. For each RDD, I call a map, i.e.,
myrdd.map(myfunc), myfunc is in a separate Python module. In yet another
separate Python module I have a global list, i.e. mylist, that's populated
with metadata. I can't get myfunc to see mylist...it's always empty.
Alternat
11 matches
Mail list logo