BTW, my spark.python.worker.reuse setting is set to "true".
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-failing-on-a-mid-sized-broadcast-tp25520p25521.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
So I'm running PySpark 1.3.1 on Amazon EMR on a fairly beefy cluster (20 node
cluster with 32 cores each node and 64 gig memory) and my parallelism,
executor.instances, executor.cores and executor memory settings are also
fairly reasonable (600, 20, 30, 48gigs).
However my job invariably fails whe
Hi,
So my Spark app needs to run a sliding window through a time series dataset
(I'm not using Spark streaming). And then run different types on
aggregations on per window basis. Right now I'm using a groupByKey() which
gives me Iterables for each window. There are a few concerns I have with
this
How do i setup hadoop_conf_dir correctly when I'm running my spark job on
Yarn? My Yarn environment has the correct hadoop_conf_dir settings by the
configuration that I pull from sc.hadoopConfiguration() is incorrect.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nab
Apologies if this is something very obvious but I've perused the spark
streaming guide and this isn't very evident to me still. So I have files
with data of the format: timestamp,column1,column2,column3.. etc. and I'd
like to use the spark streaming's window operations on them.
However from what I