Re: Error: No space left on device

Chris DuBois Wed, 16 Jul 2014 22:19:17 -0700

Hi Xiangrui,

I will try this shortly. When using N partitions, do you recommend N be the
number of cores on each slave or the number of cores on the master? Forgive
my ignorance, but is this best achieved as an argument to sc.textFile?


The slaves on the EC2 clusters start with only 8gb of storage, and it
doesn't seem that /mnt/spark and /mnt2/spark are mounted anywhere else by
default. Looking at spark-ec2/setup-slaves.sh, it appears that these are
only mounted if the instance type begins with r3. (Or am I not reading that
right?) My slaves are a different instance type, and currently look like
this:
Filesystem            Size  Used Avail Use% Mounted on
/dev/xvda1            7.9G  7.3G  515M  94% /
tmpfs                 7.4G  4.0K  7.4G   1% /dev/shm
/dev/xvdv             500G  2.5G  498G   1% /vol

I have been able to finish ALS on MovieLens 10M only twice, taking 221s and
315s for 20 iterations at K=20 and lambda=0.01. Does that timing sound
about right, or does it point to a poor configuration? The same script with
MovieLens 1M runs fine in about 30-40s with the same settings. (In both
cases I'm training on 70% of the data.)

Thanks for your help!
Chris


On Wed, Jul 16, 2014 at 4:29 PM, Xiangrui Meng <men...@gmail.com> wrote:

> For ALS, I would recommend repartitioning the ratings to match the
> number of CPU cores or even less. ALS is not computation heavy for
> small k but communication heavy. Having small number of partitions may
> help. For EC2 clusters, we use /mnt/spark and /mnt2/spark as the
> default local directory because they are local hard drives. Did your
> last run of ALS on MovieLens 10M-100K with the default settings
> succeed? -Xiangrui
>
> On Wed, Jul 16, 2014 at 8:00 AM, Chris DuBois <chris.dub...@gmail.com>
> wrote:
> > Hi Xiangrui,
> >
> > I accidentally did not send df -i for the master node. Here it is at the
> > moment of failure:
> >
> > Filesystem            Inodes   IUsed   IFree IUse% Mounted on
> > /dev/xvda1            524288  280938  243350   54% /
> > tmpfs                3845409       1 3845408    1% /dev/shm
> > /dev/xvdb            10002432    1027 10001405    1% /mnt
> > /dev/xvdf            10002432      16 10002416    1% /mnt2
> > /dev/xvdv            524288000      13 524287987    1% /vol
> >
> > I am using default settings now, but is there a way to make sure that the
> > proper directories are being used? How many blocks/partitions do you
> > recommend?
> >
> > Chris
> >
> >
> > On Wed, Jul 16, 2014 at 1:09 AM, Chris DuBois <chris.dub...@gmail.com>
> > wrote:
> >>
> >> Hi Xiangrui,
> >>
> >> Here is the result on the master node:
> >> $ df -i
> >> Filesystem            Inodes   IUsed   IFree IUse% Mounted on
> >> /dev/xvda1            524288  273997  250291   53% /
> >> tmpfs                1917974       1 1917973    1% /dev/shm
> >> /dev/xvdv            524288000      30 524287970    1% /vol
> >>
> >> I have reproduced the error while using the MovieLens 10M data set on a
> >> newly created cluster.
> >>
> >> Thanks for the help.
> >> Chris
> >>
> >>
> >> On Wed, Jul 16, 2014 at 12:22 AM, Xiangrui Meng <men...@gmail.com>
> wrote:
> >>>
> >>> Hi Chris,
> >>>
> >>> Could you also try `df -i` on the master node? How many
> >>> blocks/partitions did you set?
> >>>
> >>> In the current implementation, ALS doesn't clean the shuffle data
> >>> because the operations are chained together. But it shouldn't run out
> >>> of disk space on the MovieLens dataset, which is small. spark-ec2
> >>> script sets /mnt/spark and /mnt/spark2 as the local.dir by default, I
> >>> would recommend leaving this setting as the default value.
> >>>
> >>> Best,
> >>> Xiangrui
> >>>
> >>> On Wed, Jul 16, 2014 at 12:02 AM, Chris DuBois <chris.dub...@gmail.com
> >
> >>> wrote:
> >>> > Thanks for the quick responses!
> >>> >
> >>> > I used your final -Dspark.local.dir suggestion, but I see this during
> >>> > the
> >>> > initialization of the application:
> >>> >
> >>> > 14/07/16 06:56:08 INFO storage.DiskBlockManager: Created local
> >>> > directory at
> >>> > /vol/spark-local-20140716065608-7b2a
> >>> >
> >>> > I would have expected something in /mnt/spark/.
> >>> >
> >>> > Thanks,
> >>> > Chris
> >>> >
> >>> >
> >>> >
> >>> > On Tue, Jul 15, 2014 at 11:44 PM, Chris Gore <cdg...@cdgore.com>
> wrote:
> >>> >>
> >>> >> Hi Chris,
> >>> >>
> >>> >> I've encountered this error when running Spark’s ALS methods too.
>  In
> >>> >> my
> >>> >> case, it was because I set spark.local.dir improperly, and every
> time
> >>> >> there
> >>> >> was a shuffle, it would spill many GB of data onto the local drive.
> >>> >> What
> >>> >> fixed it was setting it to use the /mnt directory, where a network
> >>> >> drive is
> >>> >> mounted.  For example, setting an environmental variable:
> >>> >>
> >>> >> export SPACE=$(mount | grep mnt | awk '{print $3"/spark/"}' | xargs
> |
> >>> >> sed
> >>> >> 's/ /,/g’)
> >>> >>
> >>> >> Then adding -Dspark.local.dir=$SPACE or simply
> >>> >> -Dspark.local.dir=/mnt/spark/,/mnt2/spark/ when you run your driver
> >>> >> application
> >>> >>
> >>> >> Chris
> >>> >>
> >>> >> On Jul 15, 2014, at 11:39 PM, Xiangrui Meng <men...@gmail.com>
> wrote:
> >>> >>
> >>> >> > Check the number of inodes (df -i). The assembly build may create
> >>> >> > many
> >>> >> > small files. -Xiangrui
> >>> >> >
> >>> >> > On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois
> >>> >> > <chris.dub...@gmail.com>
> >>> >> > wrote:
> >>> >> >> Hi all,
> >>> >> >>
> >>> >> >> I am encountering the following error:
> >>> >> >>
> >>> >> >> INFO scheduler.TaskSetManager: Loss was due to
> java.io.IOException:
> >>> >> >> No
> >>> >> >> space
> >>> >> >> left on device [duplicate 4]
> >>> >> >>
> >>> >> >> For each slave, df -h looks roughtly like this, which makes the
> >>> >> >> above
> >>> >> >> error
> >>> >> >> surprising.
> >>> >> >>
> >>> >> >> Filesystem            Size  Used Avail Use% Mounted on
> >>> >> >> /dev/xvda1            7.9G  4.4G  3.5G  57% /
> >>> >> >> tmpfs                 7.4G  4.0K  7.4G   1% /dev/shm
> >>> >> >> /dev/xvdb              37G  3.3G   32G  10% /mnt
> >>> >> >> /dev/xvdf              37G  2.0G   34G   6% /mnt2
> >>> >> >> /dev/xvdv             500G   33M  500G   1% /vol
> >>> >> >>
> >>> >> >> I'm on an EC2 cluster (c3.xlarge + 5 x m3) that I launched using
> >>> >> >> the
> >>> >> >> spark-ec2 scripts and a clone of spark from today. The job I am
> >>> >> >> running
> >>> >> >> closely resembles the collaborative filtering example. This issue
> >>> >> >> happens
> >>> >> >> with the 1M version as well as the 10 million rating version of
> the
> >>> >> >> MovieLens dataset.
> >>> >> >>
> >>> >> >> I have seen previous questions, but they haven't helped yet. For
> >>> >> >> example, I
> >>> >> >> tried setting the Spark tmp directory to the EBS volume at /vol/,
> >>> >> >> both
> >>> >> >> by
> >>> >> >> editing the spark conf file (and copy-dir'ing it to the slaves)
> as
> >>> >> >> well
> >>> >> >> as
> >>> >> >> through the SparkConf. Yet I still get the above error. Here is
> my
> >>> >> >> current
> >>> >> >> Spark config below. Note that I'm launching via
> >>> >> >> ~/spark/bin/spark-submit.
> >>> >> >>
> >>> >> >> conf = SparkConf()
> >>> >> >> conf.setAppName("RecommendALS").set("spark.local.dir",
> >>> >> >> "/vol/").set("spark.executor.memory",
> >>> >> >> "7g").set("spark.akka.frameSize",
> >>> >> >> "100").setExecutorEnv("SPARK_JAVA_OPTS", "
> >>> >> >> -Dspark.akka.frameSize=100")
> >>> >> >> sc = SparkContext(conf=conf)
> >>> >> >>
> >>> >> >> Thanks for any advice,
> >>> >> >> Chris
> >>> >> >>
> >>> >>
> >>> >
> >>
> >>
> >
>

Re: Error: No space left on device

Reply via email to