Re: Settings for K-Means Clustering in Mlib for large data set

2015-06-23 Thread Xiangrui Meng
A rough estimate of the worst case memory requirement for driver is about 2 * k * runs * numFeatures * numPartitions * 8 bytes. I put 2 at the beginning because the previous centers are still in memory while receiving new center updates. -Xiangrui On Fri, Jun 19, 2015 at 9:02 AM, Rogers Jeffrey w

Re: Settings for K-Means Clustering in Mlib for large data set

2015-06-19 Thread Rogers Jeffrey
Thanks. Setting the driver memory property worked for K=1000 . But when I increased K to1500 I get the following error: 15/06/19 09:38:44 INFO ContextCleaner: Cleaned accumulator 7 15/06/19 09:38:44 INFO BlockManagerInfo: Removed broadcast_34_piece0 on 172.31.3.51:45157 in memory (size: 1568.0

Re: Settings for K-Means Clustering in Mlib for large data set

2015-06-18 Thread Rogers Jeffrey
I am submitting the application from a python notebook. I am launching pyspark as follows: SPARK_PUBLIC_DNS=ec2-54-165-202-17.compute-1.amazonaws.com SPARK_WORKER_CORES=8 SPARK_WORKER_MEMORY=15g SPARK_MEM=30g OUR_JAVA_MEM=30g SPARK_DAEMON_JAVA_OPTS="-XX:MaxPermSize=30g -Xms30g -Xmx30g" IPYTHON=1

Re: Settings for K-Means Clustering in Mlib for large data set

2015-06-18 Thread Xiangrui Meng
With 80,000 features and 1000 clusters, you need 80,000,000 doubles to store the cluster centers. That is ~600MB. If there are 10 partitions, you might need 6GB on the driver to collect updates from workers. I guess the driver died. Did you specify driver memory with spark-submit? -Xiangrui On Thu

Settings for K-Means Clustering in Mlib for large data set

2015-06-18 Thread Rogers Jeffrey
Hi All, I am trying to run KMeans clustering on a large data set with 12,000 points and 80,000 dimensions. I have a spark cluster in Ec2 stand alone mode with 8 workers running on 2 slaves with 160 GB Ram and 40 VCPU. My Code is as Follows: def convert_into_sparse_vector(A): non_nan_indic