(can't reply to user@, so pulling @dev instead. sorry)

(can't reply to user@, so pulling @dev instead)

There is no fundamental reason why the hadoop-cloud POM and artifact isn't
built/released by the ASF spark project; I think the effort it took to get
the spark-hadoop-cloud module it in at all was enough to put me off trying
to get the artifact released.

Including the AWS SDK in the spark tarball the main thing to question.

It does contain some minimal binding classes to deal with two issues, both
of which are actually fixable if anyone sat down to do it.


   1. Spark using mapreduce V1 APIs (org.apache.hadoop.mapred) vs v2
   ((org.apache.hadoop.mapredreduce.{input, output,... }). That's fixable in
   spark; a shim class was just a lot less traumatic.
   2. Parquet being fussy about writing to a subclass of
   ParquetOutputCommitter. Again, a shim does that, alternative is a fix in
   Parquet. Or I modify the original Hadoop FileOutputCommitter to actually
   wrap/forward to a new committer. I chose not not to do that from the outset
   because that class scares me. Nothing has changed my opinion there. FWIW
   EMR just did their S3-only committer as a subclass of
   ParquetOutputCommitter. Simpler solution if you don't have to care about
   other committers for other stores.

Move spark to MRv2 APIs and Parquet lib to downgrade if the committer isn't
a subclass (it wants the option to call writeMetaDataFile()), and the need
for those shims goes away.

What the module also does is import the relevant hadoop-aws, hadoop-azure
modules etc and strip out anything which complicates life. When published
to the maven repo then, apps can import it downstream and get a consistent
set of hadooop-* artifacts, and the AWS artifacts which they've been
compiled and tested with.

They are published by both cloudera and palantir; it'd be really good for
the world as a whole if the ASF published them too, in sync with the rest
of the release

https://mvnrepository.com/artifact/org.apache.spark/spark-hadoop-cloud


There's one other aspect of the module, which is when it is built the spark
distribution includes the AWS SDK bundle, which is a few hundred MB and
growing.

Why use the whole shaded JAR?" Classpaths. Jackson versions, httpclient
versions, etc: if they weren't shaded it'd be very hard to get a consistent
set of dependencies. There's the side benefit of having one consistent set
of AWS libraries, so spark-kinesis will be in sync with s3a client,
DynamoDB client, etc, etc. (
https://issues.apache.org/jira/browse/HADOOP-17197 )

There's a very good case for excluding that SDK from the distro unless you
are confident people really want it. Instead just say "this release
contains all the ASF dependencies needed to work with AWS, just add
"aws-sdk-bundle 1.11.XYZ".

I'm happy to work on that if I can get some promise of review time from
others.

On related notes

Hadoop 3.3.1 RCs are up for testing. For S3A this includes everything in
https://issues.apache.org/jira/browse/HADOOP-16829   big speedups in list
calls, and you can turn off deletion of dir marking for significant IO
gains/reduced throttling. Do play ASAP, do complain on issues: this is your
last chance before things ship.

For everything else, yes, many benefits. And, courtesy of Huawei, native
ARM support too. Your VM cost/hour just went down for all workloads where
you don't need GPUs.

*The RC2 artifacts are at*:
https://home.apache.org/~weichiu/hadoop-3.3.1-RC2/
ARM artifacts: https://home.apache.org/~weichiu/hadoop-3.3.1-RC2-arm/


*The maven artifacts are hosted here:*
https://repository.apache.org/content/repositories/orgapachehadoop-1318/


Independent of that, anyone working on Azure or GCS who wants spark to
write output in a classic Hive partitioned directory structure -there's a
WiP committer which promises speed and correctness even when the store
(GCS) doesn't do atomic dir renames.

https://github.com/apache/hadoop/pull/2971

Reviews and testing with private datasets strongly encouraged, and I'd love
to get the IOStatistics parts of the _SUCCESS files to see what happened.
This committer measures time to list/rename/mkdir in task and job commit,
and aggregates them all into the final report.

-Steve

On Mon, 31 May 2021 at 13:35, Sean Owen <sro...@gmail.com> wrote:

> I know it's not enabled by default when the binary artifacts are built,
> but not exactly sure why it's not built separately at all. It's almost a
> dependencies-only pom artifact, but there are two source files. Steve do
> you have an angle on that?
>
> On Mon, May 31, 2021 at 5:37 AM Erik Torres <etserr...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm following this documentation
>> <https://spark.apache.org/docs/latest/cloud-integration.html#installation> to
>> configure my Spark-based application to interact with Amazon S3. However, I
>> cannot find the spark-hadoop-cloud module in Maven central for the
>> non-commercial distribution of Apache Spark. From the documentation I would
>> expect that I can get this module as a Maven dependency in my project.
>> However, I ended up building the spark-hadoop-cloud module from the Spark's
>> code <https://github.com/apache/spark>.
>>
>> Is this the expected way to setup the integration with Amazon S3? I think
>> I'm missing something here.
>>
>

Reply via email to