Hi list, I am having troubles running a PCA with pyspark. I am trying to reduce a matrix size since my features after OHE gets 40k wide.
Spark 2.2.0 Stand-alone (Oracle JVM) pyspark 2.2.0 from a docker (OpenJDK) I'm starting the spark session from the notebook however I make sure to: - PYSPARK_SUBMIT_ARGS: "--packages ... --driver-memory 20G pyspark-shell" - sparkConf.set("spark.executor.memory", "24G") - sparkConf.set("spark.driver.memory", "20G") My executors gets 24Gb per node, and my driver process starts with: /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /usr/local/spark/conf/:/usr/local/spark/jars/* -Xmx20G org.apache.spark.deploy.SparkSubmit --conf spark.executor.memory=24G ... pyspark-shell So I should have plenty of memory to play with, however when running PCA.fit I get in the spark driver logs: 19/02/08 01:02:43 WARN TaskSetManager: Stage 29 contains a task of very large size (142 KB). The maximum recommended task size is 100 KB. 19/02/08 01:02:43 WARN RowMatrix: 34771 columns will require at least 9672 megabytes of memory! 19/02/08 01:02:46 WARN RowMatrix: 34771 columns will require at least 9672 megabytes of memory! Eventually fails: Py4JJavaError: An error occurred while calling o287.fit. : java.lang.OutOfMemoryError at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41) ... at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeGramianMatrix(RowMatrix.scala:122) at org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:344) at org.apache.spark.mllib.linalg.distributed.RowMatrix.computePrincipalComponentsAndExplainedVariance(RowMatrix.scala:387) at org.apache.spark.mllib.feature.PCA.fit(PCA.scala:48) ... What am I missing ? Any hints much appreciated,