I'd like to know if the broadcasted object gets serialized when accessed by the
executor during the execution of a task?
I know that it gets serialized from the driver to the worker. This question is
inside worker when executor JVM's are accessing it
thanks
Jeff
could someone please comment on this? thanks
From: jeffsar...@hotmail.com
To: user@spark.apache.org
Subject: Access to broadcasted variable
Date: Thu, 18 Feb 2016 14:44:07 -0500
I'd like to know if the broadcasted object gets serialized when accessed by the
executor during the execution o
iable
From: shixi...@databricks.com
To: jeffsar...@hotmail.com
CC: user@spark.apache.org
The broadcasted object is serialized in driver and sent to the executors. And
in the executor, it will deserialize the bytes to get the broadcasted object.
On Fri, Feb 19, 2016 at 5:54 AM, jeff saremi
Is there anyway to let spark know ahead of time what size of RDD to expect as a
result of a flatmap() operation?
And would that help in terms of performance?
For instance, if I have an RDD of 1million rows and I know that my flatMap()
will produce 100million rows, is there a way to indicate that
There are executor logs and driver logs. Most of them are not intuitive enough
to mean anything to us.
Are there any notes, documents, talks on how to decipher these logs and
troubleshoot our applications' performance as a result?
thanks
Jeff
So that it is available even in offline mode? I can't seem to be able to find
any notes on thatthanksjeff
-0700
CC: user@spark.apache.org
To: jeffsar...@hotmail.com
Are you talking about package which is listed onhttp://spark-packages.org
The package should come with installation instructions, right ?
On Oct 4, 2015, at 8:55 PM, jeff saremi wrote:
So that it is available even in offline mode? I c
So we tried reading a sequencefile in Spark and realized that all our records
have ended up becoming the same.
THen one of us found this:
Note: Because Hadoop's RecordReader class re-uses the same Writable object for
each record, directly caching the returned RDD or directly passing it to an
ag
ad of creating Java objects. As you've pointed out, this is
at the expense of making the code more verbose when caching.
-Sandy
On Fri, Nov 13, 2015 at 10:29 AM, jeff saremi wrote:
So we tried reading a sequencefile in Spark and realized that all our records
have ended up becoming the same
i've tried desperately to create an RDD from a matrix i have. Every combination
failed.
I have a sparse matrix returned from a call to
dv = DictVectorizer()sv_tf = dv.fit_transform(tf)
which is supposed to be a matrix of document terms and their frequencies.
I need to convert this to an
I'd like to know how -- From within Java/spark -- I can access the dependent
files which i deploy using "--files" option on the command line?
i wish someone added this to the documentation
From: jeff saremi
Sent: Thursday, January 19, 2017 9:56 AM
To: Sidney Feiner
Cc: user@spark.apache.org
Subject: Re: Spark-submit: where do --files go?
Thanks Sidney
From: Sidney
Thanks Sidney
From: Sidney Feiner
Sent: Thursday, January 19, 2017 9:52 AM
To: jeff saremi
Cc: user@spark.apache.org
Subject: Re: Spark-submit: where do --files go?
Every executor creates a directory with your submitted files and you can access
every file
I have this function which does a regex matching in scala. I test it in the
REPL I get expected results.
I use it as a UDF in sparkSQL i get completely incorrect results.
Function:
class UrlFilter (filters: Seq[String]) extends Serializable {
val regexFilters = filters.map(new Regex(_))
r
never mind!
I has a space at the end of my data which was not showing up in manual testing.
thanks
From: jeff saremi
Sent: Tuesday, June 20, 2017 2:48:06 PM
To: user@spark.apache.org
Subject: Bizzare diff in behavior between scala REPL and sparkSQL UDF
I have
You can do a map() using a select and functions/UDFs. But how do you process a
partition using SQL?
ach in SQL.
From: Ryan
Sent: Sunday, June 25, 2017 7:18:32 PM
To: jeff saremi
Cc: user@spark.apache.org
Subject: Re: What is the equivalent of mapPartitions in SpqrkSQL?
Why would you like to do so? I think there's no need for us to explicitly ask
for a forEachPartition in spark sql because tu
ing as
such forcing us to stay conservative and just make do without sql. I'm sure
we're not alone here.
From: Aaron Perrin
Sent: Tuesday, June 27, 2017 4:50:25 PM
To: Ryan; jeff saremi
Cc: user@spark.apache.org
Subject: Re: What is the equivalent of map
I tried this query in 1.6 and it failed:
SELECT * FROM Table1 EXCEPT ALL SELECT * FROM Table2
Exception in thread "main" java.lang.RuntimeException: [1.32] failure: ``(''
expected but `all' found
thanks
Jeff
EXCEPT is not the same as EXCEPT ALL
Had they implemented EXCEPT ALL in SparkSQL one could have easily obtained
EXCEPT by adding a disctint() to the results
From: hareesh makam
Sent: Thursday, July 6, 2017 12:48:18 PM
To: jeff saremi
Cc: user@spark.apache.org
On the Spark status UI you can click Stages on the menu and see Active (and
completed stages). For the active stage, you can see Succeeded/Total and a
count of failed ones in paranthesis.
I'm looking for a way to go straight to the failed tasks and list the errors.
Currently I must go into deta
Thank you. That helps
From: 周康
Sent: Monday, July 24, 2017 8:04:51 PM
To: jeff saremi
Cc: user@spark.apache.org
Subject: Re: How to list only erros for a stage
May be you can click Header Status cloumn of Task section,then failed task
will appear first.
2017
I have the simplest job which i'm running against 100TB of data. The job keeps
failing with ExecutorLostFailure's on containers killed by Yarn for exceeding
memory limits
I have varied the executor-memory from 32GB to 96GB, the
spark.yarn.executor.memoryOverhead from 8192 to 36000 and similar c
From: yohann jardin
Sent: Thursday, July 27, 2017 11:15:39 PM
To: jeff saremi; user@spark.apache.org
Subject: Re: How to configure spark on Yarn cluster
Check the executor page of the Spark UI, to check if your storage level is
limiting.
Also, instead of starting with 100 TB of data
We have a not too complex and not too large spark job that keeps dying with
this error
I have researched it and I have not seen any convincing explanation on why
I am not using a shuffle service. Which server is the one that is refusing the
connection?
If I go to the server that is being report
nk you included. Thank you. Yes this is the same problem
however it looks like no one has come up with a solution for this problem yet
From: yohann jardin
Sent: Friday, July 28, 2017 10:47:40 AM
To: jeff saremi; user@spark.apache.org
Subject: Re: How to configur
ark.network.timeout=1000s ^
From: Juan Rodríguez Hortalá
Sent: Friday, July 28, 2017 4:20:40 PM
To: jeff saremi
Cc: user@spark.apache.org
Subject: Re: Job keeps aborting because of
org.apache.spark.shuffle.FetchFailedException: Failed to connect to
server/ip:39232
Hi Je
asking this on a tangent:
Is there anyway for the shuffle data to be replicated to more than one server?
thanks
From: jeff saremi
Sent: Friday, July 28, 2017 4:38:08 PM
To: Juan Rodríguez Hortalá
Cc: user@spark.apache.org
Subject: Re: Job keeps aborting because
Calling cache/persist fails all our jobs (i have posted 2 threads on this).
And we're giving up hope in finding a solution.
So I'd like to find a workaround for that:
If I save an RDD to hdfs and read it back, can I use it in more than one
operation?
Example: (using cache)
// do a whole bunch
same effect as in my sample code
without the use of cache().
If I use myrdd.count() would that be a good alternative?
thanks
From: lucas.g...@gmail.com
Sent: Tuesday, August 1, 2017 11:23:04 AM
To: jeff saremi
Cc: user@spark.apache.org
Subject: Re: How can i r
Thanks Vadim. I'll try that
From: Vadim Semenov
Sent: Tuesday, August 1, 2017 12:05:17 PM
To: jeff saremi
Cc: user@spark.apache.org
Subject: Re: How can i remove the need for calling cache
You can use `.checkpoint()`:
```
val sc: SparkContext
sc.setCheckpoi
minimized even without an explicit cache call.
On Tue, Aug 1, 2017 at 11:05 AM, jeff saremi
mailto:jeffsar...@hotmail.com>> wrote:
Calling cache/persist fails all our jobs (i have posted 2 threads on this).
And we're giving up hope in finding a solution.
So I'd like to find a wor
hoping for
From: Vadim Semenov
Sent: Tuesday, August 1, 2017 12:05:17 PM
To: jeff saremi
Cc: user@spark.apache.org
Subject: Re: How can i remove the need for calling cache
You can use `.checkpoint()`:
```
val sc: SparkContext
sc.setCheckpointDir("hdfs:///tmp/checkpointDirectory&quo
thanks Vadim. yes this is a good option for us. thanks
From: Vadim Semenov
Sent: Wednesday, August 2, 2017 6:24:40 PM
To: Suzen, Mehmet
Cc: jeff saremi; user@spark.apache.org
Subject: Re: How can i remove the need for calling cache
So if you just save an RDD to
I'm using a statement like the following to load my dataframe from some text
file
Upon encountering the first error, the whole thing throws an exception and
processing stops.
I'd like to continue loading even if that results in zero rows in my dataframe.
How can i do that?
thanks
spark.read.
.scala:250)
____
From: jeff saremi
Sent: Tuesday, September 12, 2017 2:32:03 PM
To: user@spark.apache.org
Subject: Continue reading dataframe from file despite errors
I'm using a statement like the following to load my dataframe from some text
file
Upon encountering the first error,
thanks Suresh. it worked nicely
From: Suresh Thalamati
Sent: Tuesday, September 12, 2017 2:59:29 PM
To: jeff saremi
Cc: user@spark.apache.org
Subject: Re: Continue reading dataframe from file despite errors
Try the CSV Option(“mode”, "dropmalformed”),
I have this line which works in the spark interactive console but it fails in
Intellij
Using Spark 2.1.1 in both cases:
Exception in thread "main" java.lang.RuntimeException: Multiple sources found
for csv (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat,
com.databricks.spark.csv.
ot;com.databricks...
____
From: jeff saremi
Sent: Tuesday, September 12, 2017 3:38:00 PM
To: user@spark.apache.org
Subject: Multiple Sources found for csv
I have this line which works in the spark interactive console but it fails in
Intellij
Using Spark 2.1.1 in both cases:
Exceptio
39 matches
Mail list logo