on the worker/container that fails, the "file not found" is the first error
-- the output below is from the yarn log. There were some python worker
crashes for another job/stage earlier (see the warning at 18:36) but I
expect those to be unrelated to this file not found error.
ally finish, even though the tasks are throwing these errors while
> writing the map output? Or do you sometimes get failures on the shuffle
> write side, and sometimes on the shuffle read side? (Not that I think you
> are doing anything wrong, but it may help narrow down the root ca
I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions into
a parquet file on HDFS. I've got a few hundred nodes in the cluster, so for
the size of file this is way over-provisioned (I've tried it with fewer
partitions and fewer nodes, no obvious effect). I was expecting the dump to
yes I was expecting that too because of all the metadata generation and
compression. But I have not seen performance this bad for other parquet
files I’ve written and was wondering if there could be something obvious
(and wrong) to do with how I’ve specified the schema etc. It’s a very
simple schem
I'm not sure what you mean? I didn't do anything specifically to partition
the columns
On Nov 14, 2015 00:38, "Davies Liu" wrote:
> Do you have partitioned columns?
>
> On Thu, Nov 5, 2015 at 2:08 AM, Rok Roskar wrote:
> > I'm writing a ~100 Gb pyspark
Hi Akhil,
the namenode is definitely configured correctly, otherwise the job would
not start at all. It registers with YARN and starts up, but once the nodes
try to communicate to each other it fails. Note that a hadoop MR job using
the identical configuration executes without any problems. The dr
hi, thanks for the quick answer -- I suppose this is possible, though I
don't understand how it could come about. The largest individual RDD
elements are ~ 1 Mb in size (most are smaller) and the RDD is composed of
800k of them. The file is saved in 134 parts, but is being read in using
some 1916+
chance that one object or one batch will be
> bigger than 2G.
> Maybe there are a bug when it split the pickled file, could you create
> a RDD for each
> file, then see which file is cause the issue (maybe some of them)?
>
> On Wed, Jan 28, 2015 at 1:30 AM, Rok Roskar wrote:
>
I get this in the driver log:
java.lang.NullPointerException
at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:590)
at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(PythonRDD.scala:233)
at
org.apache.spark
,
> how much memory do you have in executor and driver?
> Do you see any other exceptions in driver and executors? Something
> related to serialization in JVM.
>
> On Tue, Feb 10, 2015 at 2:16 PM, Rok Roskar wrote:
> > I get this in the driver log:
>
> I think this shoul
Yes I actually do use mapPartitions already
On Feb 11, 2015 7:55 PM, "Charles Feduke" wrote:
> If you use mapPartitions to iterate the lookup_tables does that improve
> the performance?
>
> This link is to Spark docs 1.1 because both latest and 1.2 for Python give
> me a 404:
> http://spark.apach
015, 19:59 Davies Liu wrote:
> Could you share a short script to reproduce this problem?
>
> On Tue, Feb 10, 2015 at 8:55 PM, Rok Roskar wrote:
> > I didn't notice other errors -- I also thought such a large broadcast is
> a
> > bad idea but I tried something sim
Aha great! Thanks for the clarification!
On Feb 11, 2015 8:11 PM, "Davies Liu" wrote:
> On Wed, Feb 11, 2015 at 10:47 AM, rok wrote:
> > I was having trouble with memory exceptions when broadcasting a large
> lookup
> > table, so I've resorted to processing it iteratively -- but how can I
> modi
the runtime for each consecutive iteration is still roughly twice as long as
for the previous one -- is there a way to reduce whatever overhead is
accumulating?
On Feb 11, 2015, at 8:11 PM, Davies Liu wrote:
> On Wed, Feb 11, 2015 at 10:47 AM, rok wrote:
>> I was having trouble with memory e
utive map is slower than the one
before. I'll try the checkpoint, thanks for the suggestion.
On Feb 12, 2015, at 12:13 AM, Davies Liu wrote:
> On Wed, Feb 11, 2015 at 2:43 PM, Rok Roskar wrote:
>> the runtime for each consecutive iteration is still roughly twice as long as
>
6163,
0.38274814840717364,
0.6606453820150496,
0.8610156719813942,
0.6971353266345091,
0.9896836700210551,
0.05789392881996358]
Is there a size limit for objects serialized with Kryo? Or an option that
controls it? The Java serializer works fine.
On Wed, Feb 11, 2015 at 8:04 PM, Rok Roskar w
aha ok, thanks.
If I create different RDDs from a parent RDD and force evaluation
thread-by-thread, then it should presumably be fine, correct? Or do I need
to checkpoint the child RDDs as a precaution in case it needs to be removed
from memory and recomputed?
On Sat, Feb 28, 2015 at 4:28 AM, Shi
the feature dimension is 800k.
yes, I believe the driver memory is likely the problem since it doesn't crash
until the very last part of the tree aggregation.
I'm running it via pyspark through YARN -- I have to run in client mode so I
can't set spark.driver.memory -- I've tried setting the sp
B overhead
Is there some reason why these options are being ignored and instead
starting the driver with just 512Mb of heap?
On Thu, Apr 23, 2015 at 8:06 AM, Rok Roskar wrote:
> the feature dimension is 800k.
>
> yes, I believe the driver memory is likely the problem since it doesn
a bug?
rok
On Mon, Apr 27, 2015 at 6:54 PM, Xiangrui Meng wrote:
> You might need to specify driver memory in spark-submit instead of
> passing JVM options. spark-submit is designed to handle different
> deployments correctly. -Xiangrui
>
> On Thu, Apr 23, 2015 at 4:58 AM, Rok Roskar
Hello,
I'm interested in getting started with Spark to scale our scientific analysis
package (http://pynbody.github.io) to larger data sets. The package is written
in Python and makes heavy use of numpy/scipy and related frameworks. I've got a
couple of questions that I have not been able to fi
thanks for the quick answer!
> numpy array only can support basic types, so we can not use it during
> collect()
> by default.
>
sure, but if you knew that a numpy array went in on one end, you could safely
use it on the other end, no? Perhaps it would require an extension of the RDD
class an
I am having some issues with processes running out of memory and I'm wondering
if I'm setting things up incorrectly.
I am running a job on two nodes with 24 cores and 256Gb of memory each. I start
the pyspark shell with SPARK_EXECUTOR_MEMORY=210gb. When I run the job with
anything more than 8
I've got an RDD where each element is a long string (a whole document). I'm
using pyspark so some of the handy partition-handling functions aren't
available, and I count the number of elements in each partition with:
def count_partitions(id, iterator):
c = sum(1 for _ in iterator)
yiel
thanks for the suggestion -- however, looks like this is even slower. With
the small data set I'm using, my aggregate function takes ~ 9 seconds and
the colStats.mean() takes ~ 1 minute. However, I can't get it to run with
the Kyro serializer -- I get the error:
com.esotericsoftware.kryo.KryoExcep
i, Jan 9, 2015 at 3:46 AM, Rok Roskar wrote:
> > thanks for the suggestion -- however, looks like this is even slower.
> With
> > the small data set I'm using, my aggregate function takes ~ 9 seconds and
> > the colStats.mean() takes ~ 1 minute. However, I can't get i
26 matches
Mail list logo