Issue:
Not able to broadcast or place the files locally in the Spark worker nodes
from Spark application in Cluster deploy mode.Spark job always throws
FileNotFoundException.
Issue Description:
We are trying to access Kafka Cluster which is configured with SSL for
encryption from Spark Streami
I experienced the below two cases when unpersisting or destroying broadcast
variables in pyspark. But the same works good in spark scala shell. Any clue
why this happens ? Is it a bug in pyspark?
***Case 1:***
>>> b1 = sc.broadcast([1,2,3])
>>> b1.value
[1, 2, 3]
try this:JavaRDD mapr = listrdd.map(x -> broadcastVar.value().get(x));
On Wednesday, December 21, 2016 2:25 PM, Sateesh Karuturi
wrote:
I need to process spark Broadcast variables using Java RDD API. This is my
code what i have tried so far:This is only sample code to check whet
I need to process spark Broadcast variables using Java RDD API. This is my
code what i have tried so far:
This is only sample code to check whether its works or not? In my case i
need to work on two csvfiles.
SparkConf conf = new
SparkConf().setAppName("BroadcastVariable").setMas
global events
num = events.value
print num
events.unpersist()
events = sc.broadcast(num + 1)
alert_stream.foreachRDD(test)
# Comment this line and no error occurs
ssc.checkpoint('dir')
ssc.start()
ssc.awaitTermination()
On Fri, Jul 22, 2016 at 1:50 PM, Joe Panciera
wrot
Hi,
I'm attempting to use broadcast variables to update stateful values used
across the cluster for processing. Essentially, I have a function that is
executed in .foreachRDD that updates the broadcast variable by calling
unpersist() and then rebroadcasting. This works without issues w
I forgot the
streamingContext.start()
streamingContext.awaitTermination()
in my example code, but the error stays the same...
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Broadcast-variables-java-lang-ClassCastException-tp26828p26829
pache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Broadcast-variables-java-lang-ClassCastException-tp26828.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubs
Hi All,
I am facing a weird situation which is explained below.
Scenario and Problem: I want to add two attributes to JSON object based on the
look up table values and insert the JSON to Mongo DB. I have broadcast variable
which holds look up table. However, i am not being able to access it ins
ssc.awaitTermination()
The error I inevitably get when restoring from the checkpoint is:
Exception: (Exception("Broadcast variable '3' not loaded!",), , (3L,))
Has anyone had any luck checkpointing in PySpark with a broadcast variable?
--
View this message in c
I'm having trouble loading a streaming job from a checkpoint when a
broadcast variable is defined. I've seen the solution by TD in Scala (
https://issues.apache.org/jira/browse/SPARK-5206) that uses a singleton to
get/create an accumulator, but I can't seem to get it to work in PySpark
with a broad
Thanks for the recommendations. I had been focused on solving the problem
"within Spark" but a distributed database sounds like a better solution.
Jeff
On Sat, Aug 29, 2015 at 11:47 PM, Ted Yu wrote:
> Not sure if the race condition you mentioned is related to Cassandra's
> data consistency mod
We are using Cassandra for similar kind of problem and it works well... You
need to take care of race condition between updating the store and looking
up the store...
On Aug 29, 2015 1:31 AM, "Ted Yu" wrote:
> +1 on Jason's suggestion.
>
> bq. this large variable is broadcast many times during th
+1 on Jason's suggestion.
bq. this large variable is broadcast many times during the lifetime
Please consider making this large variable more granular. Meaning, reduce
the amount of data transferred between the key value store and your app
during update.
Cheers
On Fri, Aug 28, 2015 at 12:44 PM,
You could try using an external key value store (like HBase, Redis) and
perform lookups/updates inside of your mappers (you'd need to create the
connection within a mapPartitions code block to avoid the connection
setup/teardown overhead)?
I haven't done this myself though, so I'm just throwing th
Hi,
I am working on a Spark application that is using of a large (~3G)
broadcast variable as a lookup table. The application refines the data in
this lookup table in an iterative manner. So this large variable is
broadcast many times during the lifetime of the application process.
>From what I ha
se, it seems as if broadcast variables
> ought to be working based on the tests.
>
> I have tried the following in sparkR shell, and similar code in RStudio,
> but in both cases got the same message
>
> > randomMat <- matrix(nrow=10, ncol=10, data=rnorm(100))
> > rando
Hello,
In looking at the SparkR codebase, it seems as if broadcast variables ought
to be working based on the tests.
I have tried the following in sparkR shell, and similar code in RStudio,
but in both cases got the same message
> randomMat <- matrix(nrow=10, ncol=10, data=rno
ram
On Tue, Jul 21, 2015 at 2:34 AM, Serge Franchois
mailto:serge.franch...@altran.com>> wrote:
I might add to this that I've done the same exercise on Linux (CentOS 6) and
there, broadcast variables ARE working. Is this functionality perhaps not
exposed on Mac OS X? Or has it to do
n the
JIRA with your use-case
Thanks
Shivaram
On Tue, Jul 21, 2015 at 2:34 AM, Serge Franchois wrote:
> I might add to this that I've done the same exercise on Linux (CentOS 6)
> and
> there, broadcast variables ARE working. Is this functionality perhaps not
> exposed on Mac OS X
I might add to this that I've done the same exercise on Linux (CentOS 6) and
there, broadcast variables ARE working. Is this functionality perhaps not
exposed on Mac OS X? Or has it to do with the fact there are no native
Hadoop libs for Mac?
--
View this message in context:
http://a
I've searched high and low to use broadcast variables in R.
>Is is possible at all? I don't see them mentioned in the SparkR API.
>Or is there another way of using this feature?
>
>I need to share a large amount of data between executors.
>At the moment, I get warned about my
I've searched high and low to use broadcast variables in R.
Is is possible at all? I don't see them mentioned in the SparkR API.
Or is there another way of using this feature?
I need to share a large amount of data between executors.
At the moment, I get warned about my task being too
Hi James,
The code below shows one way how you can update the broadcast variable on
the executors:
// ... events stream setup
var startTime = new Date().getTime()
var hashMap = HashMap("1" -> ("1", 1), "2" -> ("2", 2))
var hashMapBroadcast = stream.context.sparkContext.broadcas
You cannot update the broadcasted variable.. It wont get reflected on
workers.
On Jul 3, 2015 12:18 PM, "James Cole" wrote:
> Hi all,
>
> I'm filtering a DStream using a function. I need to be able to change this
> function while the application is running (I'm polling a service to see if
> a use
Hi all,
I'm filtering a DStream using a function. I need to be able to change this
function while the application is running (I'm polling a service to see if
a user has changed their filtering). The filter function is a
transformation and runs on the workers, so that's where the updates need to
go
separately. Then just have all the code you have
> operating on your RDDs look at the new broadcast variable.
>
> But I guess there is another way to look at it -- you are creating new
> broadcast variables each time, but they all point to the same underlying
> mutable data structure.
ariable. Instead, you should create something new, and
> just broadcast it separately. Then just have all the code you have
> operating on your RDDs look at the new broadcast variable.
>
> But I guess there is another way to look at it -- you are creating new
> broadcast variabl
e code you have
operating on your RDDs look at the new broadcast variable.
But I guess there is another way to look at it -- you are creating new
broadcast variables each time, but they all point to the same underlying
mutable data structure. So in a way, you are "rebroadcasting" the same
underl
ou need to update it
> }
>
> On Sat, May 16, 2015 at 2:01 AM, N B wrote:
>
>> Thanks Ayan. Can we rebroadcast after updating in the driver?
>>
>> Thanks
>> NB.
>>
>>
>> On Fri, May 15, 2015 at 6:40 PM, ayan guha wrote:
>>
>>> Hi
>
here, with
whatever you need to update it
}
On Sat, May 16, 2015 at 2:01 AM, N B wrote:
> Thanks Ayan. Can we rebroadcast after updating in the driver?
>
> Thanks
> NB.
>
>
> On Fri, May 15, 2015 at 6:40 PM, ayan guha wrote:
>
>> Hi
>>
>> broadcast va
Thanks Ayan. Can we rebroadcast after updating in the driver?
Thanks
NB.
On Fri, May 15, 2015 at 6:40 PM, ayan guha wrote:
> Hi
>
> broadcast variables are shipped for the first time it is accessed in a
> transformation to the executors used by the transformation. It will N
Hi
broadcast variables are shipped for the first time it is accessed in a
transformation to the executors used by the transformation. It will NOT
updated subsequently, even if the value has changed. However, a new value
will be shipped to any new executor comes into play after the value has
nce a broadcast variable is created using sparkContext.broadcast(), can
>>> it
>>> ever be updated again? The use case is for something like the underlying
>>> lookup data changing over time.
>>>
>>> Thanks
>>> NB
>>>
>>>
>&g
derlying
>> lookup data changing over time.
>>
>> Thanks
>> NB
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Broadcast-variable
ain? The use case is for something like the underlying
> lookup data changing over time.
>
> Thanks
> NB
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Broadcast-variables-can-be-rebroadcast-tp22908.html
> Sent from
/Broadcast-variables-can-be-rebroadcast-tp22908.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h
5, at 3:19 AM, donhoff_h <165612...@qq.com> wrote:
> >
> > Hi, experts.
> >
> > I wrote a very little program to learn how to use Broadcast Variables,
> but met an exception. The program and the exception are listed as
> following. Could anyone help me to solve
Does line 27 correspond to brdcst.value ?
Cheers
> On Apr 21, 2015, at 3:19 AM, donhoff_h <165612...@qq.com> wrote:
>
> Hi, experts.
>
> I wrote a very little program to learn how to use Broadcast Variables, but
> met an exception. The program and the exception
Hi, experts.
I wrote a very little program to learn how to use Broadcast Variables, but met
an exception. The program and the exception are listed as following. Could
anyone help me to solve this problem? Thanks!
**My Program is as following**
object TestBroadcast02
gt;> Spark currently uses a BitTorrent like mechanism that's been tuned for
>>> datacenter environments.
>>>
>>> Mosharaf
>>> --
>>> From: Tom
>>> Sent: 3/11/2015 4:58 PM
>>> To: user@spark.ap
utdated document from 4/5 years ago.
>>
>> Spark currently uses a BitTorrent like mechanism that's been tuned for
>> datacenter environments.
>>
>> Mosharaf
>> --
>> From: Tom
>> Sent: 3/11/2015 4:58 PM
>> To
> Sent: 3/11/2015 4:58 PM
> To: user@spark.apache.org
> Subject: Which strategy is used for broadcast variables?
>
> In "Performance and Scalability of Broadcast in Spark" by Mosharaf
> Chowdhury
> I read that Spark uses HDFS for its broadcast variables. This seems h
ubject: Which strategy is used for broadcast variables?
In "Performance and Scalability of Broadcast in Spark" by Mosharaf Chowdhury
I read that Spark uses HDFS for its broadcast variables. This seems highly
inefficient. In the same paper alternatives are proposed, among which
"Bittore
In "Performance and Scalability of Broadcast in Spark" by Mosharaf Chowdhury
I read that Spark uses HDFS for its broadcast variables. This seems highly
inefficient. In the same paper alternatives are proposed, among which
"Bittorent Broadcast (BTB)". While studying "
Thanks and Regards
Samriddha
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Nullpointer-Exception-on-broadcast-variables-YARN-Cluster-mode-tp21929.html
Sent from the Apache Spark User List mailing list a
ut it's still not quite there
>> -- see
>> http://apache-spark-user-list.1001560.n3.nabble.com/iteratively-modifying-an-RDD-td21606.html
>>
>> Thanks,
>>
>> Rok
>>
>>
>> On Wed, Feb 11, 2015, 19:59 Davies Liu wrote:
>>>
>>
adcast
>> is a
>> > bad idea but I tried something similar with a much smaller dictionary
>> and
>> > encountered the same problem. I'm not familiar enough with spark
>> internals
>> > to know whether the trace indicates an issue with the broadcast
>>
ilar with a much smaller dictionary and
> > encountered the same problem. I'm not familiar enough with spark
> internals
> > to know whether the trace indicates an issue with the broadcast
> variables or
> > perhaps something different?
> >
> > The driver and
e same problem. I'm not familiar enough with spark internals
> to know whether the trace indicates an issue with the broadcast variables or
> perhaps something different?
>
> The driver and executors have 50gb of ram so memory should be fine.
>
> Thanks,
>
> Rok
>
&
I didn't notice other errors -- I also thought such a large broadcast is a
bad idea but I tried something similar with a much smaller dictionary and
encountered the same problem. I'm not familiar enough with spark internals
to know whether the trace indicates an issue with the broadcast
5, at 10:01 PM, Davies Liu wrote:
>
>> Could you paste the NPE stack trace here? It will better to create a
>> JIRA for it, thanks!
>>
>> On Tue, Feb 10, 2015 at 10:42 AM, rok wrote:
>>> I'm trying to use a broadcasted dictionary inside a map function
park cluster. I seem to recall being able
>> to do this before but at the moment I am at a loss as to what to try next.
>> Is there a limit to the size of broadcast variables? This one is rather
>> large (a few Gb dict). Thanks!
>>
>> Rok
>>
>>
>>
an IPython
> session connected to a standalone spark cluster. I seem to recall being able
> to do this before but at the moment I am at a loss as to what to try next.
> Is there a limit to the size of broadcast variables? This one is rather
> large (a few Gb dict). Thanks!
>
> Rok
>
o try next.
Is there a limit to the size of broadcast variables? This one is rather
large (a few Gb dict). Thanks!
Rok
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-Java-null-pointer-exception-when-accessing-broadcast-variables-tp21580.html
Sent fro
:34
A: user@spark.apache.org<mailto:user@spark.apache.org>
Oggetto: Broadcast variables: when should I use them?
Hello.
I have a number of "static" Arrays and Maps in my Spark Streaming driver
program.
They are simple collections, initialized with integer values and strings
directly in the co
each.
They are used in several subsequent parallel operations.
The question is:
Should I convert them into broadcast variables?
Thanks and regards.
-Bob
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Broadcast-variables-when-should-I-use-them-tp21366.html
I'm still wrapping my head around that fact that the data backing an RDD is
immutable since an RDD may need to be reconstructed from its lineage at any
point. In the context of clustering there are many iterations where an RDD may
need to change (for instance cluster assignments, etc) based on a
Hi all,
I'm iterating over an RDD (representing a distributed matrix...have to
roll my own in Python) and making changes to different submatrices at
each iteration. The loop structure looks something like:
for i in range(x):
VAR = sc.broadcast(i)
rdd.map(func1).reduceByKey(func2)
M = rdd.
uneet Lakhina wrote:
> Hi,
>
> Im confused on what is the right way to use broadcast variables from java.
>
> My code looks something like this:
>
> Map<> val = //build Map to be broadcast
> Broadcast> broadastVar = sc.broadcast(val);
>
>
> sc.textFile(...).
Hi,
Im confused on what is the right way to use broadcast variables from java.
My code looks something like this:
Map<> val = //build Map to be broadcast
Broadcast> broadastVar = sc.broadcast(val);
sc.textFile(...).map(new SomeFunction()) {
//Do something here using broadcast
Broadcast variables need to fit entirely in memory - so that's a
pretty good litmus test for whether or not to broadcast a smaller
dataset or turn it into an RDD.
On Fri, May 2, 2014 at 7:50 AM, Prashant Sharma wrote:
> I had like to be corrected on this but I am just trying to say smal
I had like to be corrected on this but I am just trying to say small enough
of the order of few 100 MBs. Imagine the size gets shipped to all nodes, it
can be a GB but not GBs and then depends on the network too.
Prashant Sharma
On Fri, May 2, 2014 at 6:42 PM, Diana Carroll wrote:
> Anyone hav
Anyone have any guidance on using a broadcast variable to ship data to
workers vs. an RDD?
Like, say I'm joining web logs in an RDD with user account data. I could
keep the account data in an RDD or if it's "small", a broadcast variable
instead. How small is small? Small enough that I know it c
64 matches
Mail list logo