Reported yesterday: https://issues.apache.org/jira/browse/HADOOP-17784
Looks like the whole landsat-pds bucket is going, which is a shame as we use its 400MB scene list file whenever we want a large file to test seek IO, a public but read-only bucket to test permissions on, etc. etc. I've sent a note (attached) to the person who manages the bucket, asking for at least the CSV.gz file to be retained, https://lists.osgeo.org/pipermail/landsat-pds/2021-June/000181.html No idea what the outcome will be there, but we should assume that it will go requester pays (no good until we add that support) and then disappear entirely. For all existing releases we'll just have to document how to declare a new CSV file, and accepting that permission and anonymous credential test suites won't run any more https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/testing.html#Configuring_the_CSV_file_read_tests.2A.2A Meanwhile: who has some good ideas for a replacement dataset? ---------- Forwarded message --------- From: Steve Loughran <ste...@cloudera.com> Date: Thu, 1 Jul 2021 at 17:31 Subject: The critical role of landsat-pds/scene-list.gz file in testing hadoop & spark S3 integration Hello, We developers of Apache Hadoop have long been using it the landsat-pds scene_list.gz file in a lot of the hadoop AWS S3 integration tests as it's offered many benefits over the last decade - Eliminates the dleay overhead of creating a multi MB dataset on test runs, so permitting random IO tests against large files even over slow network links --this has become even more important now that everyone has to work from home - Verifies the code can read .csv.gz data created through other applications - Because it is world/anonymous readable, but read-only, we can do permission and credential tests - With AWS funding the data reads, we can keep costs down, especially for open source developers without someone paying their bills. And with our policy of "patch submitter must run all the integration tests with their own set of credentials", cost and setup time matters: https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/testing.html Here are some of the test we been using it for - Verification of anonymous access credential provider - Generation of error messages when you try to delete a file you don't have a right mission for - Testing seek performance on different read strategies (whole file, random IO) - Testing S3 select IO - Verifying that clients can access us-west-2 data from S3 central call (a source of recent pain) - Some downstream-of-spark tests using it as a source of an RDD which I can then use in spark SQL calls to then convert to parquet & ORC, so testing the S3 committers. As you can imagine, the loss of the bucket is going to be significant, which is why it was unwelcome news to discover that's about to happen. we can (and will ASAP) move to any new AWS bucket/dataset we can identify supplying a .csv.gz file of a few hundred MB, but that won't help existing releases, all of whose test suites are going to break. And while we've always provided an option to change the URL, this was done for private S3 store testing..when the path != landsat-pds then permission, S3 select and region test suites are disabled on the assumption that they'd fail against private data. Coverage is going to be reduced. Could you keep the scene-list file around, even if everything else is deleted? It would be your continuing contribution to the open source/big data community. -Steve