[
https://issues.apache.org/jira/browse/HADOOP-19343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17937215#comment-17937215
]
Chris Nauroth commented on HADOOP-19343:
----------------------------------------
We had a productive meeting about this on 2025-03-20, attended by me,
[~arunchacko], [~mthakur], [[email protected]], with an exciting cameo from
[~lmccay]! I took notes, and I'd like to summarize here for the whole community:
* Reiterating the high-level proposal, we want to bring over a port of the
code from [the existing
repository|https://github.com/GoogleCloudDataproc/hadoop-connectors/], with
some simplifications made along the way. For example, we don't plan on
retaining the existing Maven multi-module structure, because that doesn't serve
any benefit in Apache Hadoop, and existing precedent is that the other cloud
file systems are a single module.
* Acceptance criteria includes all tests passing, including file system
contract tests and execution of TPC benchmarks. We're initially targeting
within 15% of performance of the existing repo with iterative improvements
after that.
* Our end goal is 100% feature parity with the existing repo. To make the
project more manageable, we want to structure this in milestones with initial
"must have" features and additional features to be added incrementally.
* One specific example of a feature we won't target initially is [GCS
hierarchical namespace|https://cloud.google.com/storage/docs/hns-overview]
("HNS") support. We'll be able to work with HNS buckets, but the initial port
won't include optimizations implemented for HNS buckets. There was a side
discussion about how this might not be too impactful considering the general
motion toward manifest table formats like Iceberg and Hudi instead of the
traditional Hive table layout. These don't drive heavy rename traffic in the
same way.
* The group agreed on targeting Apache Hadoop 3.5.0, aligned with Java 17
support. This means users who need Java 8 or 11 support wouldn't be able to use
it, but they still have access to the stable releases from the existing repo to
help with that.
* The group was generally interested in keeping a short-lived feature branch
and including it for 3.5.0 ASAP. We all want Java 17! Detailed information on
testing would really help, especially anything that goes beyond the typical
{{mvn verify}} unit + integration testing setup.
* We discussed dependency management as a potential pain point. The existing
GCSFS repo has taken the strategy of heavily shading dependencies, especially
protobuf, grpc and guava. Apache Hadoop has taken the approach of a shared
bundle of shaded stuff via the hadoop-thirdparty repo. We left this as an open
question to be settled later.
* There was some discussion of the new GCS [client-side credential access
boundary|https://cloud.google.com/iam/docs/downscoping-short-lived-credentials#client-side-token-exchange]
feature and how it relates to
[PR#587|https://github.com/GoogleCloudDataproc/hadoop-connectors/pull/587/files].
This PR provides hooks for the file system's token provider to express access
boundaries. You can use these hooks to implement a plugin integration with STS,
and that integration could use client-side exchange. However, the existing repo
does not contain a full end-to-end STS integration, and we don't have that in
current scope for transfer to ASF.
* Sidebar: I got voluntold to be the 3.5.0 release manager. :-D I was planning
on volunteering anyway, so I'm happy to help.
Next steps:
* Chris will cut a feature branch for HADOOP-19343.
* Arun will share an updated doc with more details on the development plan,
including the intended roadmap of which features to add and when.
> Add native support for GCS connector
> ------------------------------------
>
> Key: HADOOP-19343
> URL: https://issues.apache.org/jira/browse/HADOOP-19343
> Project: Hadoop Common
> Issue Type: Improvement
> Components: fs
> Affects Versions: 3.5.0
> Reporter: Abhishek Modi
> Assignee: Arunkumar Chacko
> Priority: Major
> Attachments: GCS connector for Hadoop.pdf
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]