Hi Tom, If you wish to load the file in Spark directly, you can use sc.textFile("s3n://big-data-benchmark/pavlo/...") where sc is your SparkContext. This can be done because the files should be publicly available and you don't need AWS Credentials to access them.
If you want to download the file on your local drive: you can use the link http://s3.amazonaws.com/big-data-benchmark/pavlo/... One note though, the tiny dataset doesn't seem to exist anymore. You can look at http://s3.amazonaws.com/big-data-benchmark/ to see the available files. ctrl+f tiny returned 0 matches. Best, Burak ----- Original Message ----- From: "Tom" <thubregt...@gmail.com> To: u...@spark.incubator.apache.org Sent: Tuesday, July 15, 2014 2:10:15 PM Subject: Retrieve dataset of Big Data Benchmark Hi, I would like to use the dataset used in the Big Data Benchmark <https://amplab.cs.berkeley.edu/benchmark/> on my own cluster, to run some tests between Hadoop and Spark. The dataset should be available at s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix], in the amazon cluster. Is there a way I can download this without being a user of the Amazon cluster? I tried "bin/hadoop distcp s3n://123:456@big-data-benchmark/pavlo/text/tiny/* ./" but it asks for an AWS Access Key ID and Secret Access Key which I do not have. Thanks in advance, Tom -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Retrieve-dataset-of-Big-Data-Benchmark-tp9821.html Sent from the Apache Spark User List mailing list archive at Nabble.com.