what is your hdfs replication set to? On Wed, Nov 25, 2015 at 1:31 AM, AlexG <swift...@gmail.com> wrote:
> I downloaded a 3.8 T dataset from S3 to a freshly launched spark-ec2 > cluster > with 16.73 Tb storage, using > distcp. The dataset is a collection of tar files of about 1.7 Tb each. > Nothing else was stored in the HDFS, but after completing the download, the > namenode page says that 11.59 Tb are in use. When I use hdfs du -h -s, I > see > that the dataset only takes up 3.8 Tb as expected. I navigated through the > entire HDFS hierarchy from /, and don't see where the missing space is. Any > ideas what is going on and how to rectify it? > > I'm using the spark-ec2 script to launch, with the command > > spark-ec2 -k key -i ~/.ssh/key.pem -s 29 --instance-type=r3.8xlarge > --placement-group=pcavariants --copy-aws-credentials > --hadoop-major-version=yarn --spot-price=2.8 --region=us-west-2 launch > conversioncluster > > and am not modifying any configuration files for Hadoop. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-a-3-8-T-dataset-take-up-11-59-Tb-on-HDFS-tp25471.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >