Reported yesterday: https://issues.apache.org/jira/browse/HADOOP-17784

Looks like the whole landsat-pds bucket is going, which is a shame as we
use its 400MB scene list file whenever we want a large file to test seek
IO, a public but read-only bucket to test permissions on, etc.  etc.


I've sent a note (attached) to the person who manages the bucket, asking
for at least the CSV.gz file to be retained,
https://lists.osgeo.org/pipermail/landsat-pds/2021-June/000181.html

No idea what the outcome will be there, but we should assume that it will
go requester pays (no good until we add that support) and then disappear
entirely. For all existing releases we'll just have to document how to
declare a new CSV file, and accepting that permission and anonymous
credential test suites won't run any more
https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/testing.html#Configuring_the_CSV_file_read_tests.2A.2A

Meanwhile: who has some good ideas for a replacement dataset?

---------- Forwarded message ---------
From: Steve Loughran <ste...@cloudera.com>
Date: Thu, 1 Jul 2021 at 17:31
Subject: The critical role of landsat-pds/scene-list.gz file in testing
hadoop & spark S3 integration


Hello,

We developers of Apache Hadoop have long been using it the landsat-pds
scene_list.gz file in a lot of the hadoop AWS S3 integration tests as it's
offered many benefits over the last decade


   - Eliminates the dleay overhead of creating a multi MB dataset on test
   runs, so permitting random IO tests against large files even over slow
   network links --this has become even more important now that everyone has
   to work from home
   - Verifies the code can read .csv.gz data created through other
   applications
   - Because it is world/anonymous readable, but read-only, we can do
   permission and credential tests
   - With AWS funding the data reads, we can keep costs down, especially
   for open source developers without someone paying their bills.

And with our policy of "patch submitter must run all the integration tests
with their own set of credentials", cost and setup time matters:
https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/testing.html

Here are some of the test we been using it for

   - Verification of anonymous access credential provider
   - Generation of error messages when you try to delete a file you don't
   have a right mission for
   - Testing seek performance on different read strategies (whole file,
   random IO)
   - Testing S3 select IO
   - Verifying that clients can access us-west-2 data from S3 central call
   (a source of recent pain)
   - Some downstream-of-spark tests using it as a source of an RDD which I
   can then use in spark SQL calls to then convert to parquet & ORC, so
   testing  the S3 committers.


As you can imagine, the loss of the bucket is going to be significant,
which is why it was unwelcome news to discover that's about to happen.

we can (and will ASAP) move to any new AWS bucket/dataset we can identify
supplying a .csv.gz file of a few hundred MB, but that won't help existing
releases, all of whose test suites are going to break.

And while we've always provided an option to change the URL, this was done
for private S3 store testing..when the path != landsat-pds then permission,
S3 select and region test suites are disabled on the assumption that they'd
fail against private data. Coverage is going to be reduced.

Could you keep the scene-list file around, even if everything else is
deleted?

It would be your continuing contribution to the open source/big data
community.

-Steve

Reply via email to