For ALS, I would recommend repartitioning the ratings to match the number of CPU cores or even less. ALS is not computation heavy for small k but communication heavy. Having small number of partitions may help. For EC2 clusters, we use /mnt/spark and /mnt2/spark as the default local directory because they are local hard drives. Did your last run of ALS on MovieLens 10M-100K with the default settings succeed? -Xiangrui
On Wed, Jul 16, 2014 at 8:00 AM, Chris DuBois <chris.dub...@gmail.com> wrote: > Hi Xiangrui, > > I accidentally did not send df -i for the master node. Here it is at the > moment of failure: > > Filesystem Inodes IUsed IFree IUse% Mounted on > /dev/xvda1 524288 280938 243350 54% / > tmpfs 3845409 1 3845408 1% /dev/shm > /dev/xvdb 10002432 1027 10001405 1% /mnt > /dev/xvdf 10002432 16 10002416 1% /mnt2 > /dev/xvdv 524288000 13 524287987 1% /vol > > I am using default settings now, but is there a way to make sure that the > proper directories are being used? How many blocks/partitions do you > recommend? > > Chris > > > On Wed, Jul 16, 2014 at 1:09 AM, Chris DuBois <chris.dub...@gmail.com> > wrote: >> >> Hi Xiangrui, >> >> Here is the result on the master node: >> $ df -i >> Filesystem Inodes IUsed IFree IUse% Mounted on >> /dev/xvda1 524288 273997 250291 53% / >> tmpfs 1917974 1 1917973 1% /dev/shm >> /dev/xvdv 524288000 30 524287970 1% /vol >> >> I have reproduced the error while using the MovieLens 10M data set on a >> newly created cluster. >> >> Thanks for the help. >> Chris >> >> >> On Wed, Jul 16, 2014 at 12:22 AM, Xiangrui Meng <men...@gmail.com> wrote: >>> >>> Hi Chris, >>> >>> Could you also try `df -i` on the master node? How many >>> blocks/partitions did you set? >>> >>> In the current implementation, ALS doesn't clean the shuffle data >>> because the operations are chained together. But it shouldn't run out >>> of disk space on the MovieLens dataset, which is small. spark-ec2 >>> script sets /mnt/spark and /mnt/spark2 as the local.dir by default, I >>> would recommend leaving this setting as the default value. >>> >>> Best, >>> Xiangrui >>> >>> On Wed, Jul 16, 2014 at 12:02 AM, Chris DuBois <chris.dub...@gmail.com> >>> wrote: >>> > Thanks for the quick responses! >>> > >>> > I used your final -Dspark.local.dir suggestion, but I see this during >>> > the >>> > initialization of the application: >>> > >>> > 14/07/16 06:56:08 INFO storage.DiskBlockManager: Created local >>> > directory at >>> > /vol/spark-local-20140716065608-7b2a >>> > >>> > I would have expected something in /mnt/spark/. >>> > >>> > Thanks, >>> > Chris >>> > >>> > >>> > >>> > On Tue, Jul 15, 2014 at 11:44 PM, Chris Gore <cdg...@cdgore.com> wrote: >>> >> >>> >> Hi Chris, >>> >> >>> >> I've encountered this error when running Spark’s ALS methods too. In >>> >> my >>> >> case, it was because I set spark.local.dir improperly, and every time >>> >> there >>> >> was a shuffle, it would spill many GB of data onto the local drive. >>> >> What >>> >> fixed it was setting it to use the /mnt directory, where a network >>> >> drive is >>> >> mounted. For example, setting an environmental variable: >>> >> >>> >> export SPACE=$(mount | grep mnt | awk '{print $3"/spark/"}' | xargs | >>> >> sed >>> >> 's/ /,/g’) >>> >> >>> >> Then adding -Dspark.local.dir=$SPACE or simply >>> >> -Dspark.local.dir=/mnt/spark/,/mnt2/spark/ when you run your driver >>> >> application >>> >> >>> >> Chris >>> >> >>> >> On Jul 15, 2014, at 11:39 PM, Xiangrui Meng <men...@gmail.com> wrote: >>> >> >>> >> > Check the number of inodes (df -i). The assembly build may create >>> >> > many >>> >> > small files. -Xiangrui >>> >> > >>> >> > On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois >>> >> > <chris.dub...@gmail.com> >>> >> > wrote: >>> >> >> Hi all, >>> >> >> >>> >> >> I am encountering the following error: >>> >> >> >>> >> >> INFO scheduler.TaskSetManager: Loss was due to java.io.IOException: >>> >> >> No >>> >> >> space >>> >> >> left on device [duplicate 4] >>> >> >> >>> >> >> For each slave, df -h looks roughtly like this, which makes the >>> >> >> above >>> >> >> error >>> >> >> surprising. >>> >> >> >>> >> >> Filesystem Size Used Avail Use% Mounted on >>> >> >> /dev/xvda1 7.9G 4.4G 3.5G 57% / >>> >> >> tmpfs 7.4G 4.0K 7.4G 1% /dev/shm >>> >> >> /dev/xvdb 37G 3.3G 32G 10% /mnt >>> >> >> /dev/xvdf 37G 2.0G 34G 6% /mnt2 >>> >> >> /dev/xvdv 500G 33M 500G 1% /vol >>> >> >> >>> >> >> I'm on an EC2 cluster (c3.xlarge + 5 x m3) that I launched using >>> >> >> the >>> >> >> spark-ec2 scripts and a clone of spark from today. The job I am >>> >> >> running >>> >> >> closely resembles the collaborative filtering example. This issue >>> >> >> happens >>> >> >> with the 1M version as well as the 10 million rating version of the >>> >> >> MovieLens dataset. >>> >> >> >>> >> >> I have seen previous questions, but they haven't helped yet. For >>> >> >> example, I >>> >> >> tried setting the Spark tmp directory to the EBS volume at /vol/, >>> >> >> both >>> >> >> by >>> >> >> editing the spark conf file (and copy-dir'ing it to the slaves) as >>> >> >> well >>> >> >> as >>> >> >> through the SparkConf. Yet I still get the above error. Here is my >>> >> >> current >>> >> >> Spark config below. Note that I'm launching via >>> >> >> ~/spark/bin/spark-submit. >>> >> >> >>> >> >> conf = SparkConf() >>> >> >> conf.setAppName("RecommendALS").set("spark.local.dir", >>> >> >> "/vol/").set("spark.executor.memory", >>> >> >> "7g").set("spark.akka.frameSize", >>> >> >> "100").setExecutorEnv("SPARK_JAVA_OPTS", " >>> >> >> -Dspark.akka.frameSize=100") >>> >> >> sc = SparkContext(conf=conf) >>> >> >> >>> >> >> Thanks for any advice, >>> >> >> Chris >>> >> >> >>> >> >>> > >> >> >