Hey Brian, Great points. Agree that federating a set of file systems via symlinks doesn't solve the general problem of scaling a namespace. Imagine GFS' "Name Spaces" was mostly useful for systems that grew w/o much need for rebalancing, eg log storage.
Thanks, Eli On Mon, Mar 1, 2010 at 6:31 PM, Brian Bockelman <bbock...@cse.unl.edu> wrote: > > Hey Eli, > > From past experience, static, manual namespace partitioning can really get > you in trouble - you have to manually keep things balanced. > > The following things can go wrong: > > 1) One of your pesky users grows unexpectedly by a factor of 10. > 2) Your entire system grew so much that there's not enough excess capacity to > split and balance the cluster into new pieces - the extra bandwidth required > would drive down production performance too much (or you need downtime to do > it and can't afford the downtime). > 3) Your production system began as a proof of concept, and your file name > system makes it hard to split in a sane manner because you never planned on > splitting the proof of concept in the first place! > > Any one of these can be solved with enough effort, but it can require a huge > amount of effort if you don't realize things soon enough! In fact, I seem to > remember a ACM Queue article with the original Google authors who cited > explosive application growth as one reason that manual balancing quickly fell > out of favor. > > I wouldn't deny that symlinks are an incredible tool to fight namespace > growth - but it's not a 100% solution. > > That said, I'm looking forward to symlinks to solve a few local problems! > > Brian > > On Mar 1, 2010, at 8:15 PM, Eli Collins wrote: > >> On Mon, Mar 1, 2010 at 5:42 PM, Ketan Dixit <ketan.di...@gmail.com> wrote: >>> Hello, >>> Thank you Konstantin and Allen for your reply. The information >>> provided really helped to improve my understanding. >>> However I still have few questions. >>> How Symlinks/ soft links are used to solve the probem of partitioning. >>> (Where do the symlinks point to? All the mapping is >>> stored in memory but symlinks point to file objects? This is little >>> confusing to me) >>> Can you please provide insight into this? >> >> The idea is to use symlinks to present a single namespace to clients >> that is backed by multiple file systems (hdfs or other supported >> hadoop file systems). Eg a "root" HDFS file system could contain links >> to other file systems, eg /dir1 could point to S3, /dir2 could point >> to a local file system, /dir3 could point to another HDFS file system, >> etc. Clients always contact the "root" HDFS file system but are >> transparently redirected to other file systems by symlinks. This way a >> single namespace is partitioned across multiple file systems, but the >> client only needs to know about the root file system. This >> partitioning is static (you have to establish the symlinks), though >> you can grow on the fly by adding file systems and links that point to >> them. >> >> Thanks, >> Eli > >