Hi Xiangrui, I will try this shortly. When using N partitions, do you recommend N be the number of cores on each slave or the number of cores on the master? Forgive my ignorance, but is this best achieved as an argument to sc.textFile?
The slaves on the EC2 clusters start with only 8gb of storage, and it doesn't seem that /mnt/spark and /mnt2/spark are mounted anywhere else by default. Looking at spark-ec2/setup-slaves.sh, it appears that these are only mounted if the instance type begins with r3. (Or am I not reading that right?) My slaves are a different instance type, and currently look like this: Filesystem Size Used Avail Use% Mounted on /dev/xvda1 7.9G 7.3G 515M 94% / tmpfs 7.4G 4.0K 7.4G 1% /dev/shm /dev/xvdv 500G 2.5G 498G 1% /vol I have been able to finish ALS on MovieLens 10M only twice, taking 221s and 315s for 20 iterations at K=20 and lambda=0.01. Does that timing sound about right, or does it point to a poor configuration? The same script with MovieLens 1M runs fine in about 30-40s with the same settings. (In both cases I'm training on 70% of the data.) Thanks for your help! Chris On Wed, Jul 16, 2014 at 4:29 PM, Xiangrui Meng <men...@gmail.com> wrote: > For ALS, I would recommend repartitioning the ratings to match the > number of CPU cores or even less. ALS is not computation heavy for > small k but communication heavy. Having small number of partitions may > help. For EC2 clusters, we use /mnt/spark and /mnt2/spark as the > default local directory because they are local hard drives. Did your > last run of ALS on MovieLens 10M-100K with the default settings > succeed? -Xiangrui > > On Wed, Jul 16, 2014 at 8:00 AM, Chris DuBois <chris.dub...@gmail.com> > wrote: > > Hi Xiangrui, > > > > I accidentally did not send df -i for the master node. Here it is at the > > moment of failure: > > > > Filesystem Inodes IUsed IFree IUse% Mounted on > > /dev/xvda1 524288 280938 243350 54% / > > tmpfs 3845409 1 3845408 1% /dev/shm > > /dev/xvdb 10002432 1027 10001405 1% /mnt > > /dev/xvdf 10002432 16 10002416 1% /mnt2 > > /dev/xvdv 524288000 13 524287987 1% /vol > > > > I am using default settings now, but is there a way to make sure that the > > proper directories are being used? How many blocks/partitions do you > > recommend? > > > > Chris > > > > > > On Wed, Jul 16, 2014 at 1:09 AM, Chris DuBois <chris.dub...@gmail.com> > > wrote: > >> > >> Hi Xiangrui, > >> > >> Here is the result on the master node: > >> $ df -i > >> Filesystem Inodes IUsed IFree IUse% Mounted on > >> /dev/xvda1 524288 273997 250291 53% / > >> tmpfs 1917974 1 1917973 1% /dev/shm > >> /dev/xvdv 524288000 30 524287970 1% /vol > >> > >> I have reproduced the error while using the MovieLens 10M data set on a > >> newly created cluster. > >> > >> Thanks for the help. > >> Chris > >> > >> > >> On Wed, Jul 16, 2014 at 12:22 AM, Xiangrui Meng <men...@gmail.com> > wrote: > >>> > >>> Hi Chris, > >>> > >>> Could you also try `df -i` on the master node? How many > >>> blocks/partitions did you set? > >>> > >>> In the current implementation, ALS doesn't clean the shuffle data > >>> because the operations are chained together. But it shouldn't run out > >>> of disk space on the MovieLens dataset, which is small. spark-ec2 > >>> script sets /mnt/spark and /mnt/spark2 as the local.dir by default, I > >>> would recommend leaving this setting as the default value. > >>> > >>> Best, > >>> Xiangrui > >>> > >>> On Wed, Jul 16, 2014 at 12:02 AM, Chris DuBois <chris.dub...@gmail.com > > > >>> wrote: > >>> > Thanks for the quick responses! > >>> > > >>> > I used your final -Dspark.local.dir suggestion, but I see this during > >>> > the > >>> > initialization of the application: > >>> > > >>> > 14/07/16 06:56:08 INFO storage.DiskBlockManager: Created local > >>> > directory at > >>> > /vol/spark-local-20140716065608-7b2a > >>> > > >>> > I would have expected something in /mnt/spark/. > >>> > > >>> > Thanks, > >>> > Chris > >>> > > >>> > > >>> > > >>> > On Tue, Jul 15, 2014 at 11:44 PM, Chris Gore <cdg...@cdgore.com> > wrote: > >>> >> > >>> >> Hi Chris, > >>> >> > >>> >> I've encountered this error when running Spark’s ALS methods too. > In > >>> >> my > >>> >> case, it was because I set spark.local.dir improperly, and every > time > >>> >> there > >>> >> was a shuffle, it would spill many GB of data onto the local drive. > >>> >> What > >>> >> fixed it was setting it to use the /mnt directory, where a network > >>> >> drive is > >>> >> mounted. For example, setting an environmental variable: > >>> >> > >>> >> export SPACE=$(mount | grep mnt | awk '{print $3"/spark/"}' | xargs > | > >>> >> sed > >>> >> 's/ /,/g’) > >>> >> > >>> >> Then adding -Dspark.local.dir=$SPACE or simply > >>> >> -Dspark.local.dir=/mnt/spark/,/mnt2/spark/ when you run your driver > >>> >> application > >>> >> > >>> >> Chris > >>> >> > >>> >> On Jul 15, 2014, at 11:39 PM, Xiangrui Meng <men...@gmail.com> > wrote: > >>> >> > >>> >> > Check the number of inodes (df -i). The assembly build may create > >>> >> > many > >>> >> > small files. -Xiangrui > >>> >> > > >>> >> > On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois > >>> >> > <chris.dub...@gmail.com> > >>> >> > wrote: > >>> >> >> Hi all, > >>> >> >> > >>> >> >> I am encountering the following error: > >>> >> >> > >>> >> >> INFO scheduler.TaskSetManager: Loss was due to > java.io.IOException: > >>> >> >> No > >>> >> >> space > >>> >> >> left on device [duplicate 4] > >>> >> >> > >>> >> >> For each slave, df -h looks roughtly like this, which makes the > >>> >> >> above > >>> >> >> error > >>> >> >> surprising. > >>> >> >> > >>> >> >> Filesystem Size Used Avail Use% Mounted on > >>> >> >> /dev/xvda1 7.9G 4.4G 3.5G 57% / > >>> >> >> tmpfs 7.4G 4.0K 7.4G 1% /dev/shm > >>> >> >> /dev/xvdb 37G 3.3G 32G 10% /mnt > >>> >> >> /dev/xvdf 37G 2.0G 34G 6% /mnt2 > >>> >> >> /dev/xvdv 500G 33M 500G 1% /vol > >>> >> >> > >>> >> >> I'm on an EC2 cluster (c3.xlarge + 5 x m3) that I launched using > >>> >> >> the > >>> >> >> spark-ec2 scripts and a clone of spark from today. The job I am > >>> >> >> running > >>> >> >> closely resembles the collaborative filtering example. This issue > >>> >> >> happens > >>> >> >> with the 1M version as well as the 10 million rating version of > the > >>> >> >> MovieLens dataset. > >>> >> >> > >>> >> >> I have seen previous questions, but they haven't helped yet. For > >>> >> >> example, I > >>> >> >> tried setting the Spark tmp directory to the EBS volume at /vol/, > >>> >> >> both > >>> >> >> by > >>> >> >> editing the spark conf file (and copy-dir'ing it to the slaves) > as > >>> >> >> well > >>> >> >> as > >>> >> >> through the SparkConf. Yet I still get the above error. Here is > my > >>> >> >> current > >>> >> >> Spark config below. Note that I'm launching via > >>> >> >> ~/spark/bin/spark-submit. > >>> >> >> > >>> >> >> conf = SparkConf() > >>> >> >> conf.setAppName("RecommendALS").set("spark.local.dir", > >>> >> >> "/vol/").set("spark.executor.memory", > >>> >> >> "7g").set("spark.akka.frameSize", > >>> >> >> "100").setExecutorEnv("SPARK_JAVA_OPTS", " > >>> >> >> -Dspark.akka.frameSize=100") > >>> >> >> sc = SparkContext(conf=conf) > >>> >> >> > >>> >> >> Thanks for any advice, > >>> >> >> Chris > >>> >> >> > >>> >> > >>> > > >> > >> > > >