Ognen - just so I understand. The issue is that there weren't enough inodes and this was causing a "No space left on device" error? Is that correct? If so, that's good to know because it's definitely counter intuitive.
On Sun, Mar 23, 2014 at 8:36 PM, Ognen Duzlevski <og...@nengoiksvelzud.com> wrote: > I would love to work on this (and other) stuff if I can bother someone with > questions offline or on a dev mailing list. > Ognen > > > On 3/23/14, 10:04 PM, Aaron Davidson wrote: > > Thanks for bringing this up, 100% inode utilization is an issue I haven't > seen raised before and this raises another issue which is not on our current > roadmap for state cleanup (cleaning up data which was not fully cleaned up > from a crashed process). > > > On Sun, Mar 23, 2014 at 7:57 PM, Ognen Duzlevski > <og...@plainvanillagames.com> wrote: >> >> Bleh, strike that, one of my slaves was at 100% inode utilization on the >> file system. It was /tmp/spark* leftovers that apparently did not get >> cleaned up properly after failed or interrupted jobs. >> Mental note - run a cron job on all slaves and master to clean up >> /tmp/spark* regularly. >> >> Thanks (and sorry for the noise)! >> Ognen >> >> >> On 3/23/14, 9:52 PM, Ognen Duzlevski wrote: >> >> Aaron, thanks for replying. I am very much puzzled as to what is going on. >> A job that used to run on the same cluster is failing with this mysterious >> message about not having enough disk space when in fact I can see through >> "watch df -h" that the free space is always hovering around 3+GB on the disk >> and the free inodes are at 50% (this is on master). I went through each >> slave and the spark/work/app*/stderr and stdout and spark/logs/*out files >> and no mention of too many open files failures on any of the slaves nor on >> the master :( >> >> Thanks >> Ognen >> >> On 3/23/14, 8:38 PM, Aaron Davidson wrote: >> >> By default, with P partitions (for both the pre-shuffle stage and >> post-shuffle), there are P^2 files created. With >> spark.shuffle.consolidateFiles turned on, we would instead create only P >> files. Disk space consumption is largely unaffected, however. by the number >> of partitions unless each partition is particularly small. >> >> You might look at the actual executors' logs, as it's possible that this >> error was caused by an earlier exception, such as "too many open files". >> >> >> On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski >> <og...@plainvanillagames.com> wrote: >>> >>> On 3/23/14, 5:49 PM, Matei Zaharia wrote: >>> >>> You can set spark.local.dir to put this data somewhere other than /tmp if >>> /tmp is full. Actually it's recommended to have multiple local disks and set >>> to to a comma-separated list of directories, one per disk. >>> >>> Matei, does the number of tasks/partitions in a transformation influence >>> something in terms of disk space consumption? Or inode consumption? >>> >>> Thanks, >>> Ognen >> >> >> >> -- >> "A distributed system is one in which the failure of a computer you didn't >> even know existed can render your own computer unusable" >> -- Leslie Lamport > > > > -- > "No matter what they ever do to us, we must always act for the love of our > people and the earth. We must not react out of hatred against those who have > no sense." > -- John Trudell