Re: [DISCUSS] Integration of Volcano Engine TOS in Hadoop.

Jinglun Tue, 11 Mar 2025 00:07:25 -0700

**Update**

Hello everyone, thanks a lot for your attention. I'm happy to share the 
progress here.


1. Kotlin is removed. The library ve-tos-java-sdk has been updated to 
ve-tos-java-sdk-hadoop 
(https://github.com/volcengine/ve-tos-java-sdk/tree/ve-tos-java-sdk-hadoop), 
which is developed based on apache http client.  The dependencies are as 
follows.
```
[INFO] +- com.volcengine:ve-tos-java-sdk-hadoop:jar:2.8.9:compile
[INFO] |  \- org.apache.httpcomponents.client5:httpclient5:jar:5.3:compile
[INFO] |     +- org.apache.httpcomponents.core5:httpcore5:jar:5.2.4:compile
[INFO] |     \- org.apache.httpcomponents.core5:httpcore5-h2:jar:5.2.4:compile
```

2. Unit tests have been updated to junit5. One exception is the contract tests. 
The contract tests depends on the abstract contract which are still using 
junit4, so they need to use junit4 too. After the abstract contracts are 
updated, the hadoop-tos module could all changed to junit5.


The hadoop-tos implementation could be found at: 
https://github.com/apache/hadoop/pull/7194.

On 2025/02/18 12:19:11 Jinglun wrote:
> **Update**
> 
> I discussed with the main tos sdk developer xiang([email protected]), he 
> would be happy to provide a new sdk depends on apache http-client. The new 
> sdk would be released at 3/15. 
> 
> Thanks xiang for your help. 
> 
> If there are any other questions, please feel free to comment.
> 
> On 2025/02/17 03:27:15 Jinglun wrote:
> > Thanks PJ Fanning and slfan for your suggestions !
> > 
> > > Would it be possible to consider using a lightweight HTTP client instead?
> > Thanks for reminding me, this makes sense to me. I'll try to solve it as 
> > soon as possible.
> > 
> > > We are upgrading to Junit5 and need TOS's Unit Tests to be developed in 
> > > the Junit5 way,
> > Thanks for your nice suggestion. Let me fix this.
> > 
> > I will update as soon as possible. If you have any other questions, please 
> > feel free to comment.
> > 
> > On 2025/02/15 02:59:02 slfan1989 wrote:
> > > Thanks to Jinglun for initiating the discussion on TOS.
> > > 
> > > +1 from my personal perspective.
> > > 
> > > However, considering that we are upgrading to Junit5 and need TOS's Unit
> > > Tests to be developed in the Junit5 way, we can discuss it together under
> > > the relevant PR.
> > > 
> > > Regarding PJ Fanning's suggestion, I think we should pay attention to it.
> > > He has a deeper insight into this part.
> > > 
> > > Best Regards,
> > > - Shilun Fan
> > > 
> > > On Sat, Feb 14, 2025 at 20:58 PM PJ Fanning <[email protected]> wrote:
> > > 
> > > >
> > > > Just one thing to note is that we recently removed or reduced the
> > > > okhttp3 dependency in Hadoop because the kotlin dependency brings in
> > > > big jars and more complicated management of transitive dependencies.
> > > > Would it be possible to consider using a lightweight HTTP client
> > > > instead? The built-in Java client or Apache HttpClient are examples.
> > > >
> > > > https://issues.apache.org/jira/browse/HADOOP-18890
> > > >
> > > > On Fri, 14 Feb 2025 at 12:41, Jinglun wrote:
> > > > >
> > > > > Thanks xiaoqiao and steve for your attention and comments. Let me 
> > > > > answer
> > > > the dependencies and tests.
> > > > >
> > > > > **Dependencies**
> > > > > Hadoop-tos involves a new dependency
> > > > com.volcengine:ve-tos-java-sdk:2.8.6. It is an open source project with
> > > > apache 2.0 license (
> > > > https://github.com/volcengine/ve-tos-java-sdk/blob/main/LICENSE).
> > > > >
> > > > > Here are the dependencies involved by
> > > > com.volcengine:ve-tos-java-sdk:2.8.6. They (okhttp, okio, kotlin, 
> > > > jackson)
> > > > are open source with apache 2.0 too.
> > > > > [INFO] +- com.volcengine:ve-tos-java-sdk:jar:2.8.7:compile
> > > > > [INFO] | +- com.squareup.okhttp3:okhttp:jar:4.10.0:compile
> > > > > [INFO] | | +- com.squareup.okio:okio-jvm:jar:3.0.0:compile
> > > > > [INFO] | | | \- 
> > > > > org.jetbrains.kotlin:kotlin-stdlib-jdk8:jar:1.6.20:test
> > > > > [INFO] | | | \- 
> > > > > org.jetbrains.kotlin:kotlin-stdlib-jdk7:jar:1.6.20:test
> > > > > [INFO] | | \- org.jetbrains.kotlin:kotlin-stdlib:jar:1.6.20:compile
> > > > > [INFO] | | \- org.jetbrains:annotations:jar:13.0:compile
> > > > > [INFO] | \-
> > > > com.fasterxml.jackson.core:jackson-annotations:jar:2.12.7:compile
> > > > > [INFO] +- org.jetbrains.kotlin:kotlin-stdlib-common:jar:1.6.20:compile
> > > > >
> > > > > **How is it tested**
> > > > > The hadoop-tos module has a complete unit test set, including the
> > > > contracts and extended test cases. To run it, we need a machine that can
> > > > connect to TOS. Setting the 6 environment variables below.
> > > > > ```
> > > > > export TOS_ACCESS_KEY_ID={YOUR_ACCESS_KEY}
> > > > > export TOS_SECRET_ACCESS_KEY={YOUR_SECRET_ACCESS_KEY}
> > > > > export TOS_ENDPOINT={TOS_SERVICE_ENDPOINT}
> > > > > export FILE_STORAGE_ROOT=/tmp/local_dev/
> > > > > export TOS_BUCKET={YOUR_BUCKET_NAME}
> > > > > export TOS_UNIT_TEST_ENABLED=true
> > > > > ```
> > > > > Then cd to hadoop project root directory, and run the test command 
> > > > > below.
> > > > > ```
> > > > > mvn -Dtest=org.apache.hadoop.fs.tosfs.** test -pl
> > > > org.apache.hadoop:hadoop-tos
> > > > > ```
> > > > > I also test it in a real hadoop environment. The document (index.md)
> > > > describes how to set jars and configure keys. Common tests include: 
> > > > shell
> > > > commands, Terasort, DFSIO, NNBench, Distcp, etc.
> > > > >
> > > > > **Test Environment**
> > > > > We need a VolcanoEngine account to run all the test cases. I can 
> > > > > provide
> > > > an environment for test. Please let me know if you need to test 
> > > > hadoop-tos (
> > > > [email protected]).
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On 2025/02/13 18:21:57 Steve Loughran wrote:
> > > > > > Sounds good, though expect no commitment from me to review anything.
> > > > > > My main concerns are about dependency libraries (what are they?) and
> > > > > > testing.
> > > > > >
> > > > > > On Tue, 11 Feb 2025 at 05:10, Xiaoqiao He wrote:
> > > > > >
> > > > > > > Thanks Jinglun for your work. Basically +1 from me to involve it
> > > > into the
> > > > > > > Hadoop codebase.
> > > > > > > a. After a quick review of JIRA and PR, I think it is solid 
> > > > > > > including
> > > > > > > document and code style.
> > > > > > > b. Contributors involved here are diverse who are from different
> > > > projects
> > > > > > > and companies, and active enough.
> > > > > > > c. Community with Jinlun offline many times, and IMO he could be
> > > > > > > responsible to review and test about this module.
> > > > > > > Beside that, just suggest following the Hadoop guidelines[1] to
> > > > develop
> > > > > > > the new features.
> > > > > > >
> > > > > > > @Steve Loughran @Shilun Fan leave
> > > > > > > some comments including some concerns in JIRA, would you mind 
> > > > > > > giving
> > > > more
> > > > > > > suggestions for this discussion?
> > > > > > > Thanks.
> > > > > > >
> > > > > > > Best Regards,
> > > > > > > - He Xiaoqiao
> > > > > > >
> > > > > > > [1] https://hadoop.apache.org/bylaws.html
> > > > > > >
> > > > > > >
> > > > > > > On Sun, Jan 26, 2025 at 3:39 PM jinglun wrote:
> > > > > > >
> > > > > > >> Hello everyone, I'd like to discuss the integration of volcano
> > > > engine tos
> > > > > > >> in hadoop.
> > > > > > >>
> > > > > > >>
> > > > > > >> Volcano Engine is a fast growing cloud vendor launched by
> > > > ByteDance, and
> > > > > > >> TOS is the object storage service of Volcano Engine. A common way
> > > > is to
> > > > > > >> store data into TOS and run Hadoop/Spark/Flink applications to
> > > > access TOS.
> > > > > > >> But there is no original support for TOS in hadoop, thus it is 
> > > > > > >> not
> > > > easy for
> > > > > > >> users to build their Big Data System based on TOS.
> > > > > > >>
> > > > > > >> My proposal is to integrate TOS with Hadoop to help users run 
> > > > > > >> their
> > > > > > >> applications on TOS. Users only need to do some simple
> > > > configuration, then
> > > > > > >> their applications can read/write TOS without any code change. 
> > > > > > >> This
> > > > work is
> > > > > > >> similar to AWS S3, AzureBlob, AliyunOSS, Tencnet COS and
> > > > HuaweiCloud Object
> > > > > > >> Storage in Hadoop.
> > > > > > >>
> > > > > > >>
> > > > > > >> More details could be found at
> > > > > > >> https://issues.apache.org/jira/browse/HADOOP-19236.
> > > > > > >>
> > > > > > >>
> > > > > > >> 1. What is the progress of the work now?
> > > > > > >> The work is currently finished at branch HADOOP_19236. It is
> > > > developed by
> > > > > > >> the EMR team of Volcano Engine and served many users from both
> > > > cloud and
> > > > > > >> IDC for more than 2 years.
> > > > > > >>
> > > > > > >>
> > > > > > >> 2. How is the long-term maintenance and testing guaranteed?
> > > > > > >> The contributors are opensource friendly, including ZhengHu(PMC
> > > > > > >> of HBase and Iceberg), Jinglun(Committer of
> > > > Hadoop), SunXin(Committer
> > > > > > >> of HBase), XianyinXin(Contributor of Spark), Rascal 
> > > > > > >> Wu(Contributor
> > > > of
> > > > > > >> Flink), FangBo(Contributor of Hive) and Yuanzhihuan. We will all 
> > > > > > >> be
> > > > > > >> involved in the long-term maintenance of this work. As time goes 
> > > > > > >> by,
> > > > > > >> more people from the EMR team and the hadoop-tos users may join
> > > > this work.
> > > > > > >> So I'm confident at the long-term maintenance and testing.
> > > > > > >>
> > > > > > >>
> > > > > > >> 3. Why should hadoop-tos interaged to hadoop codebase? Shall we 
> > > > > > >> use
> > > > an
> > > > > > >> independent project?
> > > > > > >> Integration is for a better user experience. First, users don't
> > > > need to
> > > > > > >> go to another repo to find the tos support. Second, users don't
> > > > need to
> > > > > > >> worry about the versions mapping between hadoop and hadoop-tos.
> > > > Finally, a
> > > > > > >> connector provided by hadoop community is more reliable and
> > > > > > >> trustworthy.
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> If you have any question, concern or any thing else that is 
> > > > > > >> unclear,
> > > > > > >> please let me know. Sincerely looking forward to your reply, 
> > > > > > >> thanks
> > > > > > >> very much.
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: [email protected]
> > > > > For additional commands, e-mail: [email protected]
> > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [email protected]
> > > > For additional commands, e-mail: [email protected]
> > > >
> > > >
> > > 
> > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> > 
> > 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [DISCUSS] Integration of Volcano Engine TOS in Hadoop.

Reply via email to