Hi Tom,

If you wish to load the file in Spark directly, you can use 
sc.textFile("s3n://big-data-benchmark/pavlo/...") where sc is your 
SparkContext. This can be
done because the files should be publicly available and you don't need AWS 
Credentials to access them.

If you want to download the file on your local drive: you can use the link 
http://s3.amazonaws.com/big-data-benchmark/pavlo/...

One note though, the tiny dataset doesn't seem to exist anymore. You can look 
at 
http://s3.amazonaws.com/big-data-benchmark/
to see the available files. ctrl+f tiny returned 0 matches.


Best,
Burak

----- Original Message -----
From: "Tom" <thubregt...@gmail.com>
To: u...@spark.incubator.apache.org
Sent: Tuesday, July 15, 2014 2:10:15 PM
Subject: Retrieve dataset of Big Data Benchmark

Hi,

I would like to use the dataset used in the  Big Data Benchmark
<https://amplab.cs.berkeley.edu/benchmark/>   on my own cluster, to run some
tests between Hadoop and Spark. The dataset should be available at
s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix],
in the amazon cluster. Is there a way I can download this without being a
user of the Amazon cluster? I tried 
"bin/hadoop distcp s3n://123:456@big-data-benchmark/pavlo/text/tiny/* ./"
but it asks for an AWS Access Key ID and Secret Access Key which I do not
have. 

Thanks in advance,

Tom



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Retrieve-dataset-of-Big-Data-Benchmark-tp9821.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to