A rough estimate of the worst case memory requirement for driver is
about 2 * k * runs * numFeatures * numPartitions * 8 bytes. I put 2 at
the beginning because the previous centers are still in memory while
receiving new center updates. -Xiangrui
On Fri, Jun 19, 2015 at 9:02 AM, Rogers Jeffrey
w
Thanks. Setting the driver memory property worked for K=1000 . But when I
increased K to1500 I get the following error:
15/06/19 09:38:44 INFO ContextCleaner: Cleaned accumulator 7
15/06/19 09:38:44 INFO BlockManagerInfo: Removed broadcast_34_piece0 on
172.31.3.51:45157 in memory (size: 1568.0
I am submitting the application from a python notebook. I am launching
pyspark as follows:
SPARK_PUBLIC_DNS=ec2-54-165-202-17.compute-1.amazonaws.com
SPARK_WORKER_CORES=8 SPARK_WORKER_MEMORY=15g SPARK_MEM=30g OUR_JAVA_MEM=30g
SPARK_DAEMON_JAVA_OPTS="-XX:MaxPermSize=30g -Xms30g -Xmx30g" IPYTHON=1
With 80,000 features and 1000 clusters, you need 80,000,000 doubles to
store the cluster centers. That is ~600MB. If there are 10 partitions,
you might need 6GB on the driver to collect updates from workers. I
guess the driver died. Did you specify driver memory with
spark-submit? -Xiangrui
On Thu
Hi All,
I am trying to run KMeans clustering on a large data set with 12,000 points
and 80,000 dimensions. I have a spark cluster in Ec2 stand alone mode
with 8 workers running on 2 slaves with 160 GB Ram and 40 VCPU.
My Code is as Follows:
def convert_into_sparse_vector(A):
non_nan_indic