Hi Tom,
If you wish to load the file in Spark directly, you can use
sc.textFile("s3n://big-data-benchmark/pavlo/...") where sc is your
SparkContext. This can be
done because the files should be publicly available and you don't need AWS
Credentials to access them.
If you want to download the file on your local drive: you can use the link
http://s3.amazonaws.com/big-data-benchmark/pavlo/...
One note though, the tiny dataset doesn't seem to exist anymore. You can look
at
http://s3.amazonaws.com/big-data-benchmark/
to see the available files. ctrl+f tiny returned 0 matches.
Best,
Burak
----- Original Message -----
From: "Tom" <[email protected]>
To: [email protected]
Sent: Tuesday, July 15, 2014 2:10:15 PM
Subject: Retrieve dataset of Big Data Benchmark
Hi,
I would like to use the dataset used in the Big Data Benchmark
<https://amplab.cs.berkeley.edu/benchmark/> on my own cluster, to run some
tests between Hadoop and Spark. The dataset should be available at
s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix],
in the amazon cluster. Is there a way I can download this without being a
user of the Amazon cluster? I tried
"bin/hadoop distcp s3n://123:456@big-data-benchmark/pavlo/text/tiny/* ./"
but it asks for an AWS Access Key ID and Secret Access Key which I do not
have.
Thanks in advance,
Tom
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Retrieve-dataset-of-Big-Data-Benchmark-tp9821.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.