Re: Broadcast Variables

2018-06-25 Thread mrsanketh
Issue: Not able to broadcast or place the files locally in the Spark worker nodes from Spark application in Cluster deploy mode.Spark job always throws FileNotFoundException. Issue Description: We are trying to access Kafka Cluster which is configured with SSL for encryption from Spark Streami

Broadcast variables: destroy/unpersist unexpected behaviour

2018-03-13 Thread Sunil
I experienced the below two cases when unpersisting or destroying broadcast variables in pyspark. But the same works good in spark scala shell. Any clue why this happens ? Is it a bug in pyspark? ***Case 1:*** >>> b1 = sc.broadcast([1,2,3]) >>> b1.value [1, 2, 3]

Re: access Broadcast Variables in Spark java

2016-12-20 Thread Richard Xin
try this:JavaRDD mapr = listrdd.map(x -> broadcastVar.value().get(x)); On Wednesday, December 21, 2016 2:25 PM, Sateesh Karuturi wrote: I need to process spark Broadcast variables using Java RDD API. This is my code what i have tried so far:This is only sample code to check whet

access Broadcast Variables in Spark java

2016-12-20 Thread Sateesh Karuturi
I need to process spark Broadcast variables using Java RDD API. This is my code what i have tried so far: This is only sample code to check whether its works or not? In my case i need to work on two csvfiles. SparkConf conf = new SparkConf().setAppName("BroadcastVariable").setMas

Re: Fatal error when using broadcast variables and checkpointing in Spark Streaming

2016-07-22 Thread Joe Panciera
global events num = events.value print num events.unpersist() events = sc.broadcast(num + 1) alert_stream.foreachRDD(test) # Comment this line and no error occurs ssc.checkpoint('dir') ssc.start() ssc.awaitTermination() On Fri, Jul 22, 2016 at 1:50 PM, Joe Panciera wrot

Fatal error when using broadcast variables and checkpointing in Spark Streaming

2016-07-22 Thread Joe Panciera
Hi, I'm attempting to use broadcast variables to update stateful values used across the cluster for processing. Essentially, I have a function that is executed in .foreachRDD that updates the broadcast variable by calling unpersist() and then rebroadcasting. This works without issues w

Re: Spark Streaming, Broadcast variables, java.lang.ClassCastException

2016-04-25 Thread mwol
I forgot the streamingContext.start() streamingContext.awaitTermination() in my example code, but the error stays the same... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Broadcast-variables-java-lang-ClassCastException-tp26828p26829

Spark Streaming, Broadcast variables, java.lang.ClassCastException

2016-04-25 Thread mwol
pache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Broadcast-variables-java-lang-ClassCastException-tp26828.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubs

Broadcast Variables not showing inside Partitions Apache Spark

2015-11-08 Thread prajwol sangat
Hi All, I am facing a weird situation which is explained below. Scenario and Problem: I want to add two attributes to JSON object based on the look up table values and insert the JSON to Mongo DB. I have broadcast variable which holds look up table. However, i am not being able to access it ins

PySpark Checkpoints with Broadcast Variables

2015-09-29 Thread Jason White
ssc.awaitTermination() The error I inevitably get when restoring from the checkpoint is: Exception: (Exception("Broadcast variable '3' not loaded!",), , (3L,)) Has anyone had any luck checkpointing in PySpark with a broadcast variable? -- View this message in c

PySpark Checkpoints with Broadcast Variables

2015-09-29 Thread Jason White
I'm having trouble loading a streaming job from a checkpoint when a broadcast variable is defined. I've seen the solution by TD in Scala ( https://issues.apache.org/jira/browse/SPARK-5206) that uses a singleton to get/create an accumulator, but I can't seem to get it to work in PySpark with a broad

Re: Alternative to Large Broadcast Variables

2015-08-29 Thread Hemminger Jeff
Thanks for the recommendations. I had been focused on solving the problem "within Spark" but a distributed database sounds like a better solution. Jeff On Sat, Aug 29, 2015 at 11:47 PM, Ted Yu wrote: > Not sure if the race condition you mentioned is related to Cassandra's > data consistency mod

Re: Alternative to Large Broadcast Variables

2015-08-29 Thread Raghavendra Pandey
We are using Cassandra for similar kind of problem and it works well... You need to take care of race condition between updating the store and looking up the store... On Aug 29, 2015 1:31 AM, "Ted Yu" wrote: > +1 on Jason's suggestion. > > bq. this large variable is broadcast many times during th

Re: Alternative to Large Broadcast Variables

2015-08-28 Thread Ted Yu
+1 on Jason's suggestion. bq. this large variable is broadcast many times during the lifetime Please consider making this large variable more granular. Meaning, reduce the amount of data transferred between the key value store and your app during update. Cheers On Fri, Aug 28, 2015 at 12:44 PM,

Re: Alternative to Large Broadcast Variables

2015-08-28 Thread Jason
You could try using an external key value store (like HBase, Redis) and perform lookups/updates inside of your mappers (you'd need to create the connection within a mapPartitions code block to avoid the connection setup/teardown overhead)? I haven't done this myself though, so I'm just throwing th

Alternative to Large Broadcast Variables

2015-08-28 Thread Hemminger Jeff
Hi, I am working on a Spark application that is using of a large (~3G) broadcast variable as a lookup table. The application refines the data in this lookup table in an iterative manner. So this large variable is broadcast many times during the lifetime of the application process. >From what I ha

Re: SparkR broadcast variables

2015-08-03 Thread Deborah Siegel
se, it seems as if broadcast variables > ought to be working based on the tests. > > I have tried the following in sparkR shell, and similar code in RStudio, > but in both cases got the same message > > > randomMat <- matrix(nrow=10, ncol=10, data=rnorm(100)) > > rando

SparkR broadcast variables

2015-08-03 Thread Deborah Siegel
Hello, In looking at the SparkR codebase, it seems as if broadcast variables ought to be working based on the tests. I have tried the following in sparkR shell, and similar code in RStudio, but in both cases got the same message > randomMat <- matrix(nrow=10, ncol=10, data=rno

Re: Broadcast variables in R

2015-07-22 Thread FRANCHOIS Serge
ram On Tue, Jul 21, 2015 at 2:34 AM, Serge Franchois mailto:serge.franch...@altran.com>> wrote: I might add to this that I've done the same exercise on Linux (CentOS 6) and there, broadcast variables ARE working. Is this functionality perhaps not exposed on Mac OS X? Or has it to do

Re: Broadcast variables in R

2015-07-21 Thread Shivaram Venkataraman
n the JIRA with your use-case Thanks Shivaram On Tue, Jul 21, 2015 at 2:34 AM, Serge Franchois wrote: > I might add to this that I've done the same exercise on Linux (CentOS 6) > and > there, broadcast variables ARE working. Is this functionality perhaps not > exposed on Mac OS X

Re: Broadcast variables in R

2015-07-21 Thread Serge Franchois
I might add to this that I've done the same exercise on Linux (CentOS 6) and there, broadcast variables ARE working. Is this functionality perhaps not exposed on Mac OS X? Or has it to do with the fact there are no native Hadoop libs for Mac? -- View this message in context: http://a

Re: Broadcast variables in R

2015-07-20 Thread Eskilson,Aleksander
I've searched high and low to use broadcast variables in R. >Is is possible at all? I don't see them mentioned in the SparkR API. >Or is there another way of using this feature? > >I need to share a large amount of data between executors. >At the moment, I get warned about my

Broadcast variables in R

2015-07-20 Thread Serge Franchois
I've searched high and low to use broadcast variables in R. Is is possible at all? I don't see them mentioned in the SparkR API. Or is there another way of using this feature? I need to share a large amount of data between executors. At the moment, I get warned about my task being too

Re: Streaming: updating broadcast variables

2015-07-06 Thread Conor Fennell
Hi James, The code below shows one way how you can update the broadcast variable on the executors: // ... events stream setup var startTime = new Date().getTime() var hashMap = HashMap("1" -> ("1", 1), "2" -> ("2", 2)) var hashMapBroadcast = stream.context.sparkContext.broadcas

Re: Streaming: updating broadcast variables

2015-07-03 Thread Raghavendra Pandey
You cannot update the broadcasted variable.. It wont get reflected on workers. On Jul 3, 2015 12:18 PM, "James Cole" wrote: > Hi all, > > I'm filtering a DStream using a function. I need to be able to change this > function while the application is running (I'm polling a service to see if > a use

Streaming: updating broadcast variables

2015-07-02 Thread James Cole
Hi all, I'm filtering a DStream using a function. I need to be able to change this function while the application is running (I'm polling a service to see if a user has changed their filtering). The filter function is a transformation and runs on the workers, so that's where the updates need to go

Re: Broadcast variables can be rebroadcast?

2015-06-03 Thread NB
separately. Then just have all the code you have > operating on your RDDs look at the new broadcast variable. > > But I guess there is another way to look at it -- you are creating new > broadcast variables each time, but they all point to the same underlying > mutable data structure.

Re: Broadcast variables can be rebroadcast?

2015-05-19 Thread N B
ariable. Instead, you should create something new, and > just broadcast it separately. Then just have all the code you have > operating on your RDDs look at the new broadcast variable. > > But I guess there is another way to look at it -- you are creating new > broadcast variabl

Re: Broadcast variables can be rebroadcast?

2015-05-19 Thread Imran Rashid
e code you have operating on your RDDs look at the new broadcast variable. But I guess there is another way to look at it -- you are creating new broadcast variables each time, but they all point to the same underlying mutable data structure. So in a way, you are "rebroadcasting" the same underl

Re: Broadcast variables can be rebroadcast?

2015-05-19 Thread N B
ou need to update it > } > > On Sat, May 16, 2015 at 2:01 AM, N B wrote: > >> Thanks Ayan. Can we rebroadcast after updating in the driver? >> >> Thanks >> NB. >> >> >> On Fri, May 15, 2015 at 6:40 PM, ayan guha wrote: >> >>> Hi >

Re: Broadcast variables can be rebroadcast?

2015-05-18 Thread Imran Rashid
here, with whatever you need to update it } On Sat, May 16, 2015 at 2:01 AM, N B wrote: > Thanks Ayan. Can we rebroadcast after updating in the driver? > > Thanks > NB. > > > On Fri, May 15, 2015 at 6:40 PM, ayan guha wrote: > >> Hi >> >> broadcast va

Re: Broadcast variables can be rebroadcast?

2015-05-16 Thread N B
Thanks Ayan. Can we rebroadcast after updating in the driver? Thanks NB. On Fri, May 15, 2015 at 6:40 PM, ayan guha wrote: > Hi > > broadcast variables are shipped for the first time it is accessed in a > transformation to the executors used by the transformation. It will N

Re: Broadcast variables can be rebroadcast?

2015-05-15 Thread ayan guha
Hi broadcast variables are shipped for the first time it is accessed in a transformation to the executors used by the transformation. It will NOT updated subsequently, even if the value has changed. However, a new value will be shipped to any new executor comes into play after the value has

Re: Broadcast variables can be rebroadcast?

2015-05-15 Thread Ilya Ganelin
nce a broadcast variable is created using sparkContext.broadcast(), can >>> it >>> ever be updated again? The use case is for something like the underlying >>> lookup data changing over time. >>> >>> Thanks >>> NB >>> >>> >&g

Re: Broadcast variables can be rebroadcast?

2015-05-15 Thread N B
derlying >> lookup data changing over time. >> >> Thanks >> NB >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Broadcast-variable

Re: Broadcast variables can be rebroadcast?

2015-05-15 Thread Ilya Ganelin
ain? The use case is for something like the underlying > lookup data changing over time. > > Thanks > NB > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Broadcast-variables-can-be-rebroadcast-tp22908.html > Sent from

Broadcast variables can be rebroadcast?

2015-05-15 Thread NB
/Broadcast-variables-can-be-rebroadcast-tp22908.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h

Re: Meet Exception when learning Broadcast Variables

2015-04-21 Thread Jeetendra Gangele
5, at 3:19 AM, donhoff_h <165612...@qq.com> wrote: > > > > Hi, experts. > > > > I wrote a very little program to learn how to use Broadcast Variables, > but met an exception. The program and the exception are listed as > following. Could anyone help me to solve

Re: Meet Exception when learning Broadcast Variables

2015-04-21 Thread Ted Yu
Does line 27 correspond to brdcst.value ? Cheers > On Apr 21, 2015, at 3:19 AM, donhoff_h <165612...@qq.com> wrote: > > Hi, experts. > > I wrote a very little program to learn how to use Broadcast Variables, but > met an exception. The program and the exception

Meet Exception when learning Broadcast Variables

2015-04-21 Thread donhoff_h
Hi, experts. I wrote a very little program to learn how to use Broadcast Variables, but met an exception. The program and the exception are listed as following. Could anyone help me to solve this problem? Thanks! **My Program is as following** object TestBroadcast02

Re: Which strategy is used for broadcast variables?

2015-03-11 Thread Tom Hubregtsen
gt;> Spark currently uses a BitTorrent like mechanism that's been tuned for >>> datacenter environments. >>> >>> Mosharaf >>> -- >>> From: Tom >>> Sent: ‎3/‎11/‎2015 4:58 PM >>> To: user@spark.ap

Re: Which strategy is used for broadcast variables?

2015-03-11 Thread Mosharaf Chowdhury
utdated document from 4/5 years ago. >> >> Spark currently uses a BitTorrent like mechanism that's been tuned for >> datacenter environments. >> >> Mosharaf >> -- >> From: Tom >> Sent: ‎3/‎11/‎2015 4:58 PM >> To

Re: Which strategy is used for broadcast variables?

2015-03-11 Thread Tom Hubregtsen
> Sent: ‎3/‎11/‎2015 4:58 PM > To: user@spark.apache.org > Subject: Which strategy is used for broadcast variables? > > In "Performance and Scalability of Broadcast in Spark" by Mosharaf > Chowdhury > I read that Spark uses HDFS for its broadcast variables. This seems h

RE: Which strategy is used for broadcast variables?

2015-03-11 Thread Mosharaf Chowdhury
ubject: Which strategy is used for broadcast variables? In "Performance and Scalability of Broadcast in Spark" by Mosharaf Chowdhury I read that Spark uses HDFS for its broadcast variables. This seems highly inefficient. In the same paper alternatives are proposed, among which "Bittore

Which strategy is used for broadcast variables?

2015-03-11 Thread Tom
In "Performance and Scalability of Broadcast in Spark" by Mosharaf Chowdhury I read that Spark uses HDFS for its broadcast variables. This seems highly inefficient. In the same paper alternatives are proposed, among which "Bittorent Broadcast (BTB)". While studying "

Nullpointer Exception on broadcast variables (YARN Cluster mode)

2015-03-05 Thread samriddhac
Thanks and Regards Samriddha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Nullpointer-Exception-on-broadcast-variables-YARN-Cluster-mode-tp21929.html Sent from the Apache Spark User List mailing list a

Re: pyspark: Java null pointer exception when accessing broadcast variables

2015-02-13 Thread Davies Liu
ut it's still not quite there >> -- see >> http://apache-spark-user-list.1001560.n3.nabble.com/iteratively-modifying-an-RDD-td21606.html >> >> Thanks, >> >> Rok >> >> >> On Wed, Feb 11, 2015, 19:59 Davies Liu wrote: >>> >>

Re: pyspark: Java null pointer exception when accessing broadcast variables

2015-02-12 Thread Rok Roskar
adcast >> is a >> > bad idea but I tried something similar with a much smaller dictionary >> and >> > encountered the same problem. I'm not familiar enough with spark >> internals >> > to know whether the trace indicates an issue with the broadcast >>

Re: pyspark: Java null pointer exception when accessing broadcast variables

2015-02-11 Thread Rok Roskar
ilar with a much smaller dictionary and > > encountered the same problem. I'm not familiar enough with spark > internals > > to know whether the trace indicates an issue with the broadcast > variables or > > perhaps something different? > > > > The driver and

Re: pyspark: Java null pointer exception when accessing broadcast variables

2015-02-11 Thread Davies Liu
e same problem. I'm not familiar enough with spark internals > to know whether the trace indicates an issue with the broadcast variables or > perhaps something different? > > The driver and executors have 50gb of ram so memory should be fine. > > Thanks, > > Rok > &

Re: pyspark: Java null pointer exception when accessing broadcast variables

2015-02-10 Thread Rok Roskar
I didn't notice other errors -- I also thought such a large broadcast is a bad idea but I tried something similar with a much smaller dictionary and encountered the same problem. I'm not familiar enough with spark internals to know whether the trace indicates an issue with the broadcast

Re: pyspark: Java null pointer exception when accessing broadcast variables

2015-02-10 Thread Davies Liu
5, at 10:01 PM, Davies Liu wrote: > >> Could you paste the NPE stack trace here? It will better to create a >> JIRA for it, thanks! >> >> On Tue, Feb 10, 2015 at 10:42 AM, rok wrote: >>> I'm trying to use a broadcasted dictionary inside a map function

Re: pyspark: Java null pointer exception when accessing broadcast variables

2015-02-10 Thread Rok Roskar
park cluster. I seem to recall being able >> to do this before but at the moment I am at a loss as to what to try next. >> Is there a limit to the size of broadcast variables? This one is rather >> large (a few Gb dict). Thanks! >> >> Rok >> >> >>

Re: pyspark: Java null pointer exception when accessing broadcast variables

2015-02-10 Thread Davies Liu
an IPython > session connected to a standalone spark cluster. I seem to recall being able > to do this before but at the moment I am at a loss as to what to try next. > Is there a limit to the size of broadcast variables? This one is rather > large (a few Gb dict). Thanks! > > Rok >

pyspark: Java null pointer exception when accessing broadcast variables

2015-02-10 Thread rok
o try next. Is there a limit to the size of broadcast variables? This one is rather large (a few Gb dict). Thanks! Rok -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-Java-null-pointer-exception-when-accessing-broadcast-variables-tp21580.html Sent fro

R: Broadcast variables: when should I use them?

2015-01-26 Thread Paolo Platter
:34 A: user@spark.apache.org<mailto:user@spark.apache.org> Oggetto: Broadcast variables: when should I use them? Hello. I have a number of "static" Arrays and Maps in my Spark Streaming driver program. They are simple collections, initialized with integer values and strings directly in the co

Broadcast variables: when should I use them?

2015-01-26 Thread frodo777
each. They are used in several subsequent parallel operations. The question is: Should I convert them into broadcast variables? Thanks and regards. -Bob -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Broadcast-variables-when-should-I-use-them-tp21366.html

RDD lineage and broadcast variables

2014-12-12 Thread Ron Ayoub
I'm still wrapping my head around that fact that the data backing an RDD is immutable since an RDD may need to be reconstructed from its lineage at any point. In the context of clustering there are many iterations where an RDD may need to change (for instance cluster assignments, etc) based on a

Iterative changes to RDD and broadcast variables

2014-11-16 Thread Shannon Quinn
Hi all, I'm iterating over an RDD (representing a distributed matrix...have to roll my own in Python) and making changes to different submatrices at each iteration. The loop structure looks something like: for i in range(x): VAR = sc.broadcast(i) rdd.map(func1).reduceByKey(func2) M = rdd.

Re: Broadcast Variables

2014-05-27 Thread Puneet Lakhina
uneet Lakhina wrote: > Hi, > > Im confused on what is the right way to use broadcast variables from java. > > My code looks something like this: > > Map<> val = //build Map to be broadcast > Broadcast> broadastVar = sc.broadcast(val); > > > sc.textFile(...).

Broadcast Variables

2014-05-22 Thread Puneet Lakhina
Hi, Im confused on what is the right way to use broadcast variables from java. My code looks something like this: Map<> val = //build Map to be broadcast Broadcast> broadastVar = sc.broadcast(val); sc.textFile(...).map(new SomeFunction()) { //Do something here using broadcast

Re: when to use broadcast variables

2014-05-03 Thread Patrick Wendell
Broadcast variables need to fit entirely in memory - so that's a pretty good litmus test for whether or not to broadcast a smaller dataset or turn it into an RDD. On Fri, May 2, 2014 at 7:50 AM, Prashant Sharma wrote: > I had like to be corrected on this but I am just trying to say smal

Re: when to use broadcast variables

2014-05-02 Thread Prashant Sharma
I had like to be corrected on this but I am just trying to say small enough of the order of few 100 MBs. Imagine the size gets shipped to all nodes, it can be a GB but not GBs and then depends on the network too. Prashant Sharma On Fri, May 2, 2014 at 6:42 PM, Diana Carroll wrote: > Anyone hav

when to use broadcast variables

2014-05-02 Thread Diana Carroll
Anyone have any guidance on using a broadcast variable to ship data to workers vs. an RDD? Like, say I'm joining web logs in an RDD with user account data. I could keep the account data in an RDD or if it's "small", a broadcast variable instead. How small is small? Small enough that I know it c