his? How can it be fixed? Right now it's rendering
many metrics useless since I want to have a complete view into the
application and I'm only seeing a few executors at a time.
Thanks,
rok
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-me
There is a stddev function since 1.6:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.stddev
If you are using spark < 1.6 you can write your own more or less easily.
On Wed, Feb 17, 2016 at 5:06 PM, mayx [via Apache Spark User List] <
ml-node+s1001560n26250.
Apologies if this appears a second time!
I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions into a
parquet file on HDFS. I've got a few hundred nodes in the cluster, so for
the size of file this is way over-provisioned (I've tried it with fewer
partitions and fewer nodes, no ob
I am trying to run Spark applications with the driver running locally and
interacting with a firewalled remote cluster via a SOCKS proxy.
I have to modify the hadoop configuration on the *local machine* to try to
make this work, adding
hadoop.rpc.socket.factory.class.default
org.apache.h
zipWithIndex gives you global indices, which is not what you want. You'll
want to use flatMap with a map function that iterates through each iterable
and returns the (String, Int, String) tuple for each element.
On Thu, Jul 30, 2015 at 4:13 AM, askformore [via Apache Spark User List] <
ml-node+s10
how much of the execution time is used up by
cPickle.load() and .dump() methods.
Hope that helps,
Rok
On Wed, Mar 8, 2017 at 3:18 AM, Yeoul Na [via Apache Spark User List] <
ml-node+s1001560n28468...@n3.nabble.com> wrote:
>
> Hi all,
>
> I am trying to analyze PySpark pe
When running the connectedComponents algorithm in GraphFrames on a
sufficiently large dataset, I get the following error I have not encountered
before:
17/04/20 20:35:26 WARN TaskSetManager: Lost task 3.0 in stage 101.0 (TID
53644, 172.19.1.206, executor 40): java.lang.IllegalStateException: Ther
:1311)
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)
Not really sure where to start looking for the culprit -- any suggestions
most welcome. Thanks!
Rok
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/NegativeArraySizeException-in-pyspark-when-loading-a
o try next.
Is there a limit to the size of broadcast variables? This one is rather
large (a few Gb dict). Thanks!
Rok
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-Java-null-pointer-exception-when-accessing-broadcast-variables-tp21580.html
Sent fro
n roughly doubles at
each iteration so this clearly doesn't seem to be the way to do it. What is
the preferred way of doing such repeated modifications to an RDD and how can
the accumulation of overhead be minimized?
Thanks!
Rok
--
View this message in context:
http://apache-spark-user-l
he applications somehow binding to the wrong
ports? Is this a spark setting I need to configure or something within YARN?
Thanks!
Rok
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/cannot-connect-to-Spark-Application-Master-in-YARN-tp21699.html
Sent
I'm seeing this java.util.NoSuchElementException: key not found: exception
pop up sometimes when I run operations on an RDD from multiple threads in a
python application. It ends up shutting down the SparkContext so I'm
assuming this is a bug -- from what I understand, I should be able to run
opera
I'm using log aggregation on YARN with Spark and I am not able to see the
logs through the YARN web UI after the application completes:
Failed redirect for container_1425390894284_0066_01_01
Failed while trying to construct the redirect url to the log server. Log
Server url may not b
rks without problems. Should I be setting up executors
differently?
Thanks,
Rok
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/StandardScaler-failing-with-OOM-errors-in-PySpark-tp22593.html
Sent from the Apache Spark User List mailing list a
I am trying to sort a collection of key,value pairs (between several hundred
million to a few billion) and have recently been getting lots of
"FetchFailedException" errors that seem to originate when one of the
executors doesn't seem to find a temporary shuffle file on disk. E.g.:
org.apache.spar
I'm processing some data using PySpark and I'd like to save the RDDs to disk
(they are (k,v) RDDs of strings and SparseVector types) and read them in
using Scala to run them through some other analysis. Is this possible?
Thanks,
Rok
--
View this message in context:
http://apache-
return getattr(self._sock,name)(*args)
error: [Errno 32] Broken pipe
I'm at a loss as to where to begin to debug this... any suggestions? Thanks,
Rok
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/using-LogisticRegressionWithSGD-train-in-Python-cras
yes, the training set is fine, I've verified it.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/using-LogisticRegressionWithSGD-train-in-Python-crashes-with-Broken-pipe-tp18182p18195.html
Sent from the Apache Spark User List mailing list archive at Nabble.
I'm trying to understand the disk I/O patterns for Spark -- specifically, I'd
like to reduce the number of files that are being written during shuffle
operations. A couple questions:
* is the amount of file I/O performed independent of the memory I allocate
for the shuffles?
* if this is the ca
Hi, I'm using Spark 1.1.0. There is no error on the executors -- it appears
as if the job never gets properly dispatched -- the only message is the
"Broken Pipe" message in the driver.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/using-LogisticRegression
I have an RDD that serves as a feature look-up table downstream in my
analysis. I create it using the zipWithIndex() and because I suppose that
the elements of the RDD could end up in a different order if it is
regenerated at any point, I cache it to try and ensure that the (feature -->
index) mapp
true though I was hoping to avoid having to sort... maybe there's no way
around it. Thanks!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/ensuring-RDD-indices-remain-immutable-tp20094p20104.html
Sent from the Apache Spark User List mailing list archive at
I have an RDD of SparseVectors and I'd like to calculate the means returning
a dense vector. I've tried doing this with the following (using pyspark,
spark v1.2.0):
def aggregate_partition_values(vec1, vec2) :
vec1[vec2.indices] += vec2.values
return vec1
def aggregate_combined_vectors(v
t the cause is. Would be
> helpful for you to send the logs.
>
> Imran
>
> On Fri, May 15, 2015 at 10:07 AM, rok wrote:
>
>> I am trying to sort a collection of key,value pairs (between several
>> hundred
>> million to a few billion) and have recently been getti
ally finish, even though the tasks are throwing these errors while
> writing the map output? Or do you sometimes get failures on the shuffle
> write side, and sometimes on the shuffle read side? (Not that I think you
> are doing anything wrong, but it may help narrow down the root ca
I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions into
a parquet file on HDFS. I've got a few hundred nodes in the cluster, so for
the size of file this is way over-provisioned (I've tried it with fewer
partitions and fewer nodes, no obvious effect). I was expecting the dump to
schema consisting of a StructType with a few StructField floats and
a string. I’m using all the spark defaults for io compression.
I'll see what I can do about running a profiler -- can you point me to a
resource/example?
Thanks,
Rok
ps: my post on the mailing list is still listed as not accept
I'm not sure what you mean? I didn't do anything specifically to partition
the columns
On Nov 14, 2015 00:38, "Davies Liu" wrote:
> Do you have partitioned columns?
>
> On Thu, Nov 5, 2015 at 2:08 AM, Rok Roskar wrote:
> > I'm writing a ~100 Gb pyspark
/24 08:10:59 INFO yarn.ApplicationMaster: ApplicationAttemptId:
appattempt_1437724871597_0001_01
15/07/24 08:11:00 INFO spark.SecurityManager: Changing view acls to: root,rok
15/07/24 08:11:00 INFO spark.SecurityManager: Changing modify acls to: root,rok
15/07/24 08:11:00 INFO
1916+ partitions (I don't know why actually -- how does this number
come about?). How can I check if any objects/batches are exceeding 2Gb?
Thanks,
Rok
On Tue, Jan 27, 2015 at 7:55 PM, Davies Liu wrote:
> Maybe it's caused by integer overflow, is it possible that one object
>
f this was somehow automatically corrected!
Thanks,
Rok
On Wed, Jan 28, 2015 at 7:01 PM, Davies Liu wrote:
> HadoopRDD will try to split the file as 64M partitions in size, so you
> got 1916+ partitions.
> (assume 100k per row, they are 80G in size).
>
> I think it has very small
> JIRA for it, thanks!
>
> On Tue, Feb 10, 2015 at 10:42 AM, rok wrote:
>> I'm trying to use a broadcasted dictionary inside a map function and am
>> consistently getting Java null pointer exceptions. This is inside an IPython
>> session connected to a standalone s
variables
or perhaps something different?
The driver and executors have 50gb of ram so memory should be fine.
Thanks,
Rok
On Feb 11, 2015 12:19 AM, "Davies Liu" wrote:
> It's brave to broadcast 8G pickled data, it will take more than 15G in
> memory for each Python worker
ive
> me a 404:
> http://spark.apache.org/docs/1.1.0/api/python/pyspark.rdd.RDD-class.html#mapPartitions
>
> On Wed Feb 11 2015 at 1:48:42 PM rok wrote:
>
>> I was having trouble with memory exceptions when broadcasting a large
>> lookup
>> table, so I've resorted to pr
I think the problem was related to the broadcasts being too large -- I've
now split it up into many smaller operations but it's still not quite there
-- see
http://apache-spark-user-list.1001560.n3.nabble.com/iteratively-modifying-an-RDD-td21606.html
Thanks,
Rok
On Wed, Feb 11, 2
Aha great! Thanks for the clarification!
On Feb 11, 2015 8:11 PM, "Davies Liu" wrote:
> On Wed, Feb 11, 2015 at 10:47 AM, rok wrote:
> > I was having trouble with memory exceptions when broadcasting a large
> lookup
> > table, so I've resorted to processin
the runtime for each consecutive iteration is still roughly twice as long as
for the previous one -- is there a way to reduce whatever overhead is
accumulating?
On Feb 11, 2015, at 8:11 PM, Davies Liu wrote:
> On Wed, Feb 11, 2015 at 10:47 AM, rok wrote:
>> I was having trouble wi
utive map is slower than the one
before. I'll try the checkpoint, thanks for the suggestion.
On Feb 12, 2015, at 12:13 AM, Davies Liu wrote:
> On Wed, Feb 11, 2015 at 2:43 PM, Rok Roskar wrote:
>> the runtime for each consecutive iteration is still roughly twice as long as
>
6163,
0.38274814840717364,
0.6606453820150496,
0.8610156719813942,
0.6971353266345091,
0.9896836700210551,
0.05789392881996358]
Is there a size limit for objects serialized with Kryo? Or an option that
controls it? The Java serializer works fine.
On Wed, Feb 11, 2015 at 8:04 PM, Rok Roskar w
, Shixiong Zhu wrote:
> RDD is not thread-safe. You should not use it in multiple threads.
>
> Best Regards,
> Shixiong Zhu
>
> 2015-02-27 23:14 GMT+08:00 rok :
>
>> I'm seeing this java.util.NoSuchElementException: key not found: exception
>> pop up sometime
ve tried setting the spark.yarn.am.memory
and overhead parameters but it doesn't seem to have an effect.
Thanks,
Rok
On Apr 23, 2015, at 7:47 AM, Xiangrui Meng wrote:
> What is the feature dimension? Did you set the driver memory? -Xiangrui
>
> On Tue, Apr 21, 2015 at 6:59 AM, ro
B overhead
Is there some reason why these options are being ignored and instead
starting the driver with just 512Mb of heap?
On Thu, Apr 23, 2015 at 8:06 AM, Rok Roskar wrote:
> the feature dimension is 800k.
>
> yes, I believe the driver memory is likely the problem since it doesn
That's exactly what I'm saying -- I specify the memory options using spark
options, but this is not reflected in how the JVM is created. No matter
which memory settings I specify, the JVM for the driver is always made with
512Mb of memory. So I'm not sure if this is a feature or
ious about any other best-practices tips anyone might have for
running pyspark with numpy data...!
Thanks!
Rok
l by
> given descriptions.
>
> For examples, you can partition your data into (path, start, length), then
>
> partitions = [(path1, start1, length), (path1, start2, length), ...]
>
> def read_chunk(path, start, length):
> f = open(path)
> f.seek(start)
>
ng if
Spark is setting the per-core memory limit somewhere?
Thanks,
Rok
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
6: 3594, 7: 134}
Why this strange redistribution of elements? I'm obviously misunderstanding how
spark does the partitioning -- is it a problem with having a list of strings as
an RDD?
Help vey much appreciated!
Thanks,
Rok
-
. -Xiangrui
>
> On Wed, Jan 7, 2015 at 9:42 AM, rok wrote:
> > I have an RDD of SparseVectors and I'd like to calculate the means
> returning
> > a dense vector. I've tried doing this with the following (using pyspark,
> > spark v1.2.0):
> >
> > def aggr
i, Jan 9, 2015 at 3:46 AM, Rok Roskar wrote:
> > thanks for the suggestion -- however, looks like this is even slower.
> With
> > the small data set I'm using, my aggregate function takes ~ 9 seconds and
> > the colStats.mean() takes ~ 1 minute. However, I can't get i
49 matches
Mail list logo