Re: FetchFailedException and MetadataFetchFailedException

2015-05-22 Thread Rok Roskar
on the worker/container that fails, the "file not found" is the first error -- the output below is from the yarn log. There were some python worker crashes for another job/stage earlier (see the warning at 18:36) but I expect those to be unrelated to this file not found error.

Re: FetchFailedException and MetadataFetchFailedException

2015-05-28 Thread Rok Roskar
ally finish, even though the tasks are throwing these errors while > writing the map output? Or do you sometimes get failures on the shuffle > write side, and sometimes on the shuffle read side? (Not that I think you > are doing anything wrong, but it may help narrow down the root ca

very slow parquet file write

2015-11-05 Thread Rok Roskar
I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions into a parquet file on HDFS. I've got a few hundred nodes in the cluster, so for the size of file this is way over-provisioned (I've tried it with fewer partitions and fewer nodes, no obvious effect). I was expecting the dump to

Re: very slow parquet file write

2015-11-06 Thread Rok Roskar
yes I was expecting that too because of all the metadata generation and compression. But I have not seen performance this bad for other parquet files I’ve written and was wondering if there could be something obvious (and wrong) to do with how I’ve specified the schema etc. It’s a very simple schem

Re: very slow parquet file write

2015-11-13 Thread Rok Roskar
I'm not sure what you mean? I didn't do anything specifically to partition the columns On Nov 14, 2015 00:38, "Davies Liu" wrote: > Do you have partitioned columns? > > On Thu, Nov 5, 2015 at 2:08 AM, Rok Roskar wrote: > > I'm writing a ~100 Gb pyspark

Re: problems running Spark on a firewalled remote YARN cluster via SOCKS proxy

2015-07-24 Thread Rok Roskar
Hi Akhil, the namenode is definitely configured correctly, otherwise the job would not start at all. It registers with YARN and starts up, but once the nodes try to communicate to each other it fails. Note that a hadoop MR job using the identical configuration executes without any problems. The dr

Re: NegativeArraySizeException in pyspark when loading an RDD pickleFile

2015-01-28 Thread Rok Roskar
hi, thanks for the quick answer -- I suppose this is possible, though I don't understand how it could come about. The largest individual RDD elements are ~ 1 Mb in size (most are smaller) and the RDD is composed of 800k of them. The file is saved in 134 parts, but is being read in using some 1916+

Re: NegativeArraySizeException in pyspark when loading an RDD pickleFile

2015-01-29 Thread Rok Roskar
chance that one object or one batch will be > bigger than 2G. > Maybe there are a bug when it split the pickled file, could you create > a RDD for each > file, then see which file is cause the issue (maybe some of them)? > > On Wed, Jan 28, 2015 at 1:30 AM, Rok Roskar wrote: >

Re: pyspark: Java null pointer exception when accessing broadcast variables

2015-02-10 Thread Rok Roskar
I get this in the driver log: java.lang.NullPointerException at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233) at org.apache.spark

Re: pyspark: Java null pointer exception when accessing broadcast variables

2015-02-10 Thread Rok Roskar
, > how much memory do you have in executor and driver? > Do you see any other exceptions in driver and executors? Something > related to serialization in JVM. > > On Tue, Feb 10, 2015 at 2:16 PM, Rok Roskar wrote: > > I get this in the driver log: > > I think this shoul

Re: iteratively modifying an RDD

2015-02-11 Thread Rok Roskar
Yes I actually do use mapPartitions already On Feb 11, 2015 7:55 PM, "Charles Feduke" wrote: > If you use mapPartitions to iterate the lookup_tables does that improve > the performance? > > This link is to Spark docs 1.1 because both latest and 1.2 for Python give > me a 404: > http://spark.apach

Re: pyspark: Java null pointer exception when accessing broadcast variables

2015-02-11 Thread Rok Roskar
015, 19:59 Davies Liu wrote: > Could you share a short script to reproduce this problem? > > On Tue, Feb 10, 2015 at 8:55 PM, Rok Roskar wrote: > > I didn't notice other errors -- I also thought such a large broadcast is > a > > bad idea but I tried something sim

Re: iteratively modifying an RDD

2015-02-11 Thread Rok Roskar
Aha great! Thanks for the clarification! On Feb 11, 2015 8:11 PM, "Davies Liu" wrote: > On Wed, Feb 11, 2015 at 10:47 AM, rok wrote: > > I was having trouble with memory exceptions when broadcasting a large > lookup > > table, so I've resorted to processing it iteratively -- but how can I > modi

Re: iteratively modifying an RDD

2015-02-11 Thread Rok Roskar
the runtime for each consecutive iteration is still roughly twice as long as for the previous one -- is there a way to reduce whatever overhead is accumulating? On Feb 11, 2015, at 8:11 PM, Davies Liu wrote: > On Wed, Feb 11, 2015 at 10:47 AM, rok wrote: >> I was having trouble with memory e

Re: iteratively modifying an RDD

2015-02-11 Thread Rok Roskar
utive map is slower than the one before. I'll try the checkpoint, thanks for the suggestion. On Feb 12, 2015, at 12:13 AM, Davies Liu wrote: > On Wed, Feb 11, 2015 at 2:43 PM, Rok Roskar wrote: >> the runtime for each consecutive iteration is still roughly twice as long as >

Re: pyspark: Java null pointer exception when accessing broadcast variables

2015-02-12 Thread Rok Roskar
6163, 0.38274814840717364, 0.6606453820150496, 0.8610156719813942, 0.6971353266345091, 0.9896836700210551, 0.05789392881996358] Is there a size limit for objects serialized with Kryo? Or an option that controls it? The Java serializer works fine. On Wed, Feb 11, 2015 at 8:04 PM, Rok Roskar w

Re: java.util.NoSuchElementException: key not found:

2015-03-02 Thread Rok Roskar
aha ok, thanks. If I create different RDDs from a parent RDD and force evaluation thread-by-thread, then it should presumably be fine, correct? Or do I need to checkpoint the child RDDs as a precaution in case it needs to be removed from memory and recomputed? On Sat, Feb 28, 2015 at 4:28 AM, Shi

Re: StandardScaler failing with OOM errors in PySpark

2015-04-22 Thread Rok Roskar
the feature dimension is 800k. yes, I believe the driver memory is likely the problem since it doesn't crash until the very last part of the tree aggregation. I'm running it via pyspark through YARN -- I have to run in client mode so I can't set spark.driver.memory -- I've tried setting the sp

Re: StandardScaler failing with OOM errors in PySpark

2015-04-23 Thread Rok Roskar
B overhead Is there some reason why these options are being ignored and instead starting the driver with just 512Mb of heap? On Thu, Apr 23, 2015 at 8:06 AM, Rok Roskar wrote: > the feature dimension is 800k. > > yes, I believe the driver memory is likely the problem since it doesn

Re: StandardScaler failing with OOM errors in PySpark

2015-04-28 Thread Rok Roskar
a bug? rok On Mon, Apr 27, 2015 at 6:54 PM, Xiangrui Meng wrote: > You might need to specify driver memory in spark-submit instead of > passing JVM options. spark-submit is designed to handle different > deployments correctly. -Xiangrui > > On Thu, Apr 23, 2015 at 4:58 AM, Rok Roskar

PySpark, numpy arrays and binary data

2014-08-06 Thread Rok Roskar
Hello, I'm interested in getting started with Spark to scale our scientific analysis package (http://pynbody.github.io) to larger data sets. The package is written in Python and makes heavy use of numpy/scipy and related frameworks. I've got a couple of questions that I have not been able to fi

Re: PySpark, numpy arrays and binary data

2014-08-07 Thread Rok Roskar
thanks for the quick answer! > numpy array only can support basic types, so we can not use it during > collect() > by default. > sure, but if you knew that a numpy array went in on one end, you could safely use it on the other end, no? Perhaps it would require an extension of the RDD class an

out of memory errors -- per core memory limits?

2014-08-21 Thread Rok Roskar
I am having some issues with processes running out of memory and I'm wondering if I'm setting things up incorrectly. I am running a job on two nodes with 24 cores and 256Gb of memory each. I start the pyspark shell with SPARK_EXECUTOR_MEMORY=210gb. When I run the job with anything more than 8

repartitioning an RDD yielding imbalance

2014-08-28 Thread Rok Roskar
I've got an RDD where each element is a long string (a whole document). I'm using pyspark so some of the handy partition-handling functions aren't available, and I count the number of elements in each partition with: def count_partitions(id, iterator): c = sum(1 for _ in iterator) yiel

Re: calculating the mean of SparseVector RDD

2015-01-09 Thread Rok Roskar
thanks for the suggestion -- however, looks like this is even slower. With the small data set I'm using, my aggregate function takes ~ 9 seconds and the colStats.mean() takes ~ 1 minute. However, I can't get it to run with the Kyro serializer -- I get the error: com.esotericsoftware.kryo.KryoExcep

Re: calculating the mean of SparseVector RDD

2015-01-12 Thread Rok Roskar
i, Jan 9, 2015 at 3:46 AM, Rok Roskar wrote: > > thanks for the suggestion -- however, looks like this is even slower. > With > > the small data set I'm using, my aggregate function takes ~ 9 seconds and > > the colStats.mean() takes ~ 1 minute. However, I can't get i