spark metrics in graphite missing for some executors

2015-12-11 Thread rok
his? How can it be fixed? Right now it's rendering many metrics useless since I want to have a complete view into the application and I'm only seeing a few executors at a time. Thanks, rok -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-me

Re: Is stddev not a supported aggregation function in SparkSQL WindowSpec?

2016-02-18 Thread rok
There is a stddev function since 1.6: http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.stddev If you are using spark < 1.6 you can write your own more or less easily. On Wed, Feb 17, 2016 at 5:06 PM, mayx [via Apache Spark User List] < ml-node+s1001560n26250.

very slow parquet file write

2015-11-05 Thread rok
Apologies if this appears a second time! I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions into a parquet file on HDFS. I've got a few hundred nodes in the cluster, so for the size of file this is way over-provisioned (I've tried it with fewer partitions and fewer nodes, no ob

problems running Spark on a firewalled remote YARN cluster via SOCKS proxy

2015-07-22 Thread rok
I am trying to run Spark applications with the driver running locally and interacting with a firewalled remote cluster via a SOCKS proxy. I have to modify the hadoop configuration on the *local machine* to try to make this work, adding hadoop.rpc.socket.factory.class.default org.apache.h

Re: help plz! how to use zipWithIndex to each subset of a RDD

2015-07-30 Thread rok
zipWithIndex gives you global indices, which is not what you want. You'll want to use flatMap with a map function that iterates through each iterable and returns the (String, Int, String) tuple for each element. On Thu, Jul 30, 2015 at 4:13 AM, askformore [via Apache Spark User List] < ml-node+s10

Re: PySpark Serialization/Deserialization (Pickling) Overhead

2017-03-08 Thread rok
how much of the execution time is used up by cPickle.load() and .dump() methods. Hope that helps, Rok On Wed, Mar 8, 2017 at 3:18 AM, Yeoul Na [via Apache Spark User List] < ml-node+s1001560n28468...@n3.nabble.com> wrote: > > Hi all, > > I am trying to analyze PySpark pe

"java.lang.IllegalStateException: There is no space for new record" in GraphFrames

2017-04-28 Thread rok
When running the connectedComponents algorithm in GraphFrames on a sufficiently large dataset, I get the following error I have not encountered before: 17/04/20 20:35:26 WARN TaskSetManager: Lost task 3.0 in stage 101.0 (TID 53644, 172.19.1.206, executor 40): java.lang.IllegalStateException: Ther

NegativeArraySizeException in pyspark when loading an RDD pickleFile

2015-01-27 Thread rok
:1311) org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183) Not really sure where to start looking for the culprit -- any suggestions most welcome. Thanks! Rok -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NegativeArraySizeException-in-pyspark-when-loading-a

pyspark: Java null pointer exception when accessing broadcast variables

2015-02-10 Thread rok
o try next. Is there a limit to the size of broadcast variables? This one is rather large (a few Gb dict). Thanks! Rok -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-Java-null-pointer-exception-when-accessing-broadcast-variables-tp21580.html Sent fro

iteratively modifying an RDD

2015-02-11 Thread rok
n roughly doubles at each iteration so this clearly doesn't seem to be the way to do it. What is the preferred way of doing such repeated modifications to an RDD and how can the accumulation of overhead be minimized? Thanks! Rok -- View this message in context: http://apache-spark-user-l

cannot connect to Spark Application Master in YARN

2015-02-18 Thread rok
he applications somehow binding to the wrong ports? Is this a spark setting I need to configure or something within YARN? Thanks! Rok -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/cannot-connect-to-Spark-Application-Master-in-YARN-tp21699.html Sent

java.util.NoSuchElementException: key not found:

2015-02-27 Thread rok
I'm seeing this java.util.NoSuchElementException: key not found: exception pop up sometimes when I run operations on an RDD from multiple threads in a python application. It ends up shutting down the SparkContext so I'm assuming this is a bug -- from what I understand, I should be able to run opera

failure to display logs on YARN UI with log aggregation on

2015-03-09 Thread rok
I'm using log aggregation on YARN with Spark and I am not able to see the logs through the YARN web UI after the application completes: Failed redirect for container_1425390894284_0066_01_01 Failed while trying to construct the redirect url to the log server. Log Server url may not b

StandardScaler failing with OOM errors in PySpark

2015-04-21 Thread rok
rks without problems. Should I be setting up executors differently? Thanks, Rok -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/StandardScaler-failing-with-OOM-errors-in-PySpark-tp22593.html Sent from the Apache Spark User List mailing list a

FetchFailedException and MetadataFetchFailedException

2015-05-15 Thread rok
I am trying to sort a collection of key,value pairs (between several hundred million to a few billion) and have recently been getting lots of "FetchFailedException" errors that seem to originate when one of the executors doesn't seem to find a temporary shuffle file on disk. E.g.: org.apache.spar

sharing RDDs between PySpark and Scala

2014-10-30 Thread rok
I'm processing some data using PySpark and I'd like to save the RDDs to disk (they are (k,v) RDDs of strings and SparseVector types) and read them in using Scala to run them through some other analysis. Is this possible? Thanks, Rok -- View this message in context: http://apache-

using LogisticRegressionWithSGD.train in Python crashes with "Broken pipe"

2014-11-05 Thread rok
return getattr(self._sock,name)(*args) error: [Errno 32] Broken pipe I'm at a loss as to where to begin to debug this... any suggestions? Thanks, Rok -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/using-LogisticRegressionWithSGD-train-in-Python-cras

Re: using LogisticRegressionWithSGD.train in Python crashes with "Broken pipe"

2014-11-05 Thread rok
yes, the training set is fine, I've verified it. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/using-LogisticRegressionWithSGD-train-in-Python-crashes-with-Broken-pipe-tp18182p18195.html Sent from the Apache Spark User List mailing list archive at Nabble.

minimizing disk I/O

2014-11-13 Thread rok
I'm trying to understand the disk I/O patterns for Spark -- specifically, I'd like to reduce the number of files that are being written during shuffle operations. A couple questions: * is the amount of file I/O performed independent of the memory I allocate for the shuffles? * if this is the ca

Re: using LogisticRegressionWithSGD.train in Python crashes with "Broken pipe"

2014-11-13 Thread rok
Hi, I'm using Spark 1.1.0. There is no error on the executors -- it appears as if the job never gets properly dispatched -- the only message is the "Broken Pipe" message in the driver. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/using-LogisticRegression

ensuring RDD indices remain immutable

2014-12-01 Thread rok
I have an RDD that serves as a feature look-up table downstream in my analysis. I create it using the zipWithIndex() and because I suppose that the elements of the RDD could end up in a different order if it is regenerated at any point, I cache it to try and ensure that the (feature --> index) mapp

Re: ensuring RDD indices remain immutable

2014-12-01 Thread rok
true though I was hoping to avoid having to sort... maybe there's no way around it. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ensuring-RDD-indices-remain-immutable-tp20094p20104.html Sent from the Apache Spark User List mailing list archive at

calculating the mean of SparseVector RDD

2015-01-07 Thread rok
I have an RDD of SparseVectors and I'd like to calculate the means returning a dense vector. I've tried doing this with the following (using pyspark, spark v1.2.0): def aggregate_partition_values(vec1, vec2) : vec1[vec2.indices] += vec2.values return vec1 def aggregate_combined_vectors(v

Re: FetchFailedException and MetadataFetchFailedException

2015-05-22 Thread Rok Roskar
t the cause is. Would be > helpful for you to send the logs. > > Imran > > On Fri, May 15, 2015 at 10:07 AM, rok wrote: > >> I am trying to sort a collection of key,value pairs (between several >> hundred >> million to a few billion) and have recently been getti

Re: FetchFailedException and MetadataFetchFailedException

2015-05-28 Thread Rok Roskar
ally finish, even though the tasks are throwing these errors while > writing the map output? Or do you sometimes get failures on the shuffle > write side, and sometimes on the shuffle read side? (Not that I think you > are doing anything wrong, but it may help narrow down the root ca

very slow parquet file write

2015-11-05 Thread Rok Roskar
I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions into a parquet file on HDFS. I've got a few hundred nodes in the cluster, so for the size of file this is way over-provisioned (I've tried it with fewer partitions and fewer nodes, no obvious effect). I was expecting the dump to

Re: very slow parquet file write

2015-11-06 Thread Rok Roskar
schema consisting of a StructType with a few StructField floats and a string. I’m using all the spark defaults for io compression. I'll see what I can do about running a profiler -- can you point me to a resource/example? Thanks, Rok ps: my post on the mailing list is still listed as not accept

Re: very slow parquet file write

2015-11-13 Thread Rok Roskar
I'm not sure what you mean? I didn't do anything specifically to partition the columns On Nov 14, 2015 00:38, "Davies Liu" wrote: > Do you have partitioned columns? > > On Thu, Nov 5, 2015 at 2:08 AM, Rok Roskar wrote: > > I'm writing a ~100 Gb pyspark

Re: problems running Spark on a firewalled remote YARN cluster via SOCKS proxy

2015-07-24 Thread Rok Roskar
/24 08:10:59 INFO yarn.ApplicationMaster: ApplicationAttemptId: appattempt_1437724871597_0001_01 15/07/24 08:11:00 INFO spark.SecurityManager: Changing view acls to: root,rok 15/07/24 08:11:00 INFO spark.SecurityManager: Changing modify acls to: root,rok 15/07/24 08:11:00 INFO

Re: NegativeArraySizeException in pyspark when loading an RDD pickleFile

2015-01-28 Thread Rok Roskar
1916+ partitions (I don't know why actually -- how does this number come about?). How can I check if any objects/batches are exceeding 2Gb? Thanks, Rok On Tue, Jan 27, 2015 at 7:55 PM, Davies Liu wrote: > Maybe it's caused by integer overflow, is it possible that one object >

Re: NegativeArraySizeException in pyspark when loading an RDD pickleFile

2015-01-29 Thread Rok Roskar
f this was somehow automatically corrected! Thanks, Rok On Wed, Jan 28, 2015 at 7:01 PM, Davies Liu wrote: > HadoopRDD will try to split the file as 64M partitions in size, so you > got 1916+ partitions. > (assume 100k per row, they are 80G in size). > > I think it has very small

Re: pyspark: Java null pointer exception when accessing broadcast variables

2015-02-10 Thread Rok Roskar
> JIRA for it, thanks! > > On Tue, Feb 10, 2015 at 10:42 AM, rok wrote: >> I'm trying to use a broadcasted dictionary inside a map function and am >> consistently getting Java null pointer exceptions. This is inside an IPython >> session connected to a standalone s

Re: pyspark: Java null pointer exception when accessing broadcast variables

2015-02-10 Thread Rok Roskar
variables or perhaps something different? The driver and executors have 50gb of ram so memory should be fine. Thanks, Rok On Feb 11, 2015 12:19 AM, "Davies Liu" wrote: > It's brave to broadcast 8G pickled data, it will take more than 15G in > memory for each Python worker

Re: iteratively modifying an RDD

2015-02-11 Thread Rok Roskar
ive > me a 404: > http://spark.apache.org/docs/1.1.0/api/python/pyspark.rdd.RDD-class.html#mapPartitions > > On Wed Feb 11 2015 at 1:48:42 PM rok wrote: > >> I was having trouble with memory exceptions when broadcasting a large >> lookup >> table, so I've resorted to pr

Re: pyspark: Java null pointer exception when accessing broadcast variables

2015-02-11 Thread Rok Roskar
I think the problem was related to the broadcasts being too large -- I've now split it up into many smaller operations but it's still not quite there -- see http://apache-spark-user-list.1001560.n3.nabble.com/iteratively-modifying-an-RDD-td21606.html Thanks, Rok On Wed, Feb 11, 2

Re: iteratively modifying an RDD

2015-02-11 Thread Rok Roskar
Aha great! Thanks for the clarification! On Feb 11, 2015 8:11 PM, "Davies Liu" wrote: > On Wed, Feb 11, 2015 at 10:47 AM, rok wrote: > > I was having trouble with memory exceptions when broadcasting a large > lookup > > table, so I've resorted to processin

Re: iteratively modifying an RDD

2015-02-11 Thread Rok Roskar
the runtime for each consecutive iteration is still roughly twice as long as for the previous one -- is there a way to reduce whatever overhead is accumulating? On Feb 11, 2015, at 8:11 PM, Davies Liu wrote: > On Wed, Feb 11, 2015 at 10:47 AM, rok wrote: >> I was having trouble wi

Re: iteratively modifying an RDD

2015-02-11 Thread Rok Roskar
utive map is slower than the one before. I'll try the checkpoint, thanks for the suggestion. On Feb 12, 2015, at 12:13 AM, Davies Liu wrote: > On Wed, Feb 11, 2015 at 2:43 PM, Rok Roskar wrote: >> the runtime for each consecutive iteration is still roughly twice as long as >

Re: pyspark: Java null pointer exception when accessing broadcast variables

2015-02-12 Thread Rok Roskar
6163, 0.38274814840717364, 0.6606453820150496, 0.8610156719813942, 0.6971353266345091, 0.9896836700210551, 0.05789392881996358] Is there a size limit for objects serialized with Kryo? Or an option that controls it? The Java serializer works fine. On Wed, Feb 11, 2015 at 8:04 PM, Rok Roskar w

Re: java.util.NoSuchElementException: key not found:

2015-03-02 Thread Rok Roskar
, Shixiong Zhu wrote: > RDD is not thread-safe. You should not use it in multiple threads. > > Best Regards, > Shixiong Zhu > > 2015-02-27 23:14 GMT+08:00 rok : > >> I'm seeing this java.util.NoSuchElementException: key not found: exception >> pop up sometime

Re: StandardScaler failing with OOM errors in PySpark

2015-04-22 Thread Rok Roskar
ve tried setting the spark.yarn.am.memory and overhead parameters but it doesn't seem to have an effect. Thanks, Rok On Apr 23, 2015, at 7:47 AM, Xiangrui Meng wrote: > What is the feature dimension? Did you set the driver memory? -Xiangrui > > On Tue, Apr 21, 2015 at 6:59 AM, ro

Re: StandardScaler failing with OOM errors in PySpark

2015-04-23 Thread Rok Roskar
B overhead Is there some reason why these options are being ignored and instead starting the driver with just 512Mb of heap? On Thu, Apr 23, 2015 at 8:06 AM, Rok Roskar wrote: > the feature dimension is 800k. > > yes, I believe the driver memory is likely the problem since it doesn

Re: StandardScaler failing with OOM errors in PySpark

2015-04-28 Thread Rok Roskar
That's exactly what I'm saying -- I specify the memory options using spark options, but this is not reflected in how the JVM is created. No matter which memory settings I specify, the JVM for the driver is always made with 512Mb of memory. So I'm not sure if this is a feature or

PySpark, numpy arrays and binary data

2014-08-06 Thread Rok Roskar
ious about any other best-practices tips anyone might have for running pyspark with numpy data...! Thanks! Rok

Re: PySpark, numpy arrays and binary data

2014-08-07 Thread Rok Roskar
l by > given descriptions. > > For examples, you can partition your data into (path, start, length), then > > partitions = [(path1, start1, length), (path1, start2, length), ...] > > def read_chunk(path, start, length): > f = open(path) > f.seek(start) >

out of memory errors -- per core memory limits?

2014-08-21 Thread Rok Roskar
ng if Spark is setting the per-core memory limit somewhere? Thanks, Rok - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

repartitioning an RDD yielding imbalance

2014-08-28 Thread Rok Roskar
6: 3594, 7: 134} Why this strange redistribution of elements? I'm obviously misunderstanding how spark does the partitioning -- is it a problem with having a list of strings as an RDD? Help vey much appreciated! Thanks, Rok -

Re: calculating the mean of SparseVector RDD

2015-01-09 Thread Rok Roskar
. -Xiangrui > > On Wed, Jan 7, 2015 at 9:42 AM, rok wrote: > > I have an RDD of SparseVectors and I'd like to calculate the means > returning > > a dense vector. I've tried doing this with the following (using pyspark, > > spark v1.2.0): > > > > def aggr

Re: calculating the mean of SparseVector RDD

2015-01-12 Thread Rok Roskar
i, Jan 9, 2015 at 3:46 AM, Rok Roskar wrote: > > thanks for the suggestion -- however, looks like this is even slower. > With > > the small data set I'm using, my aggregate function takes ~ 9 seconds and > > the colStats.mean() takes ~ 1 minute. However, I can't get i