Re: [DISCUSS] Integration of Volcano Engine TOS in Hadoop.

Jinglun Fri, 14 Mar 2025 01:53:44 -0700

Thanks xiaoqiao for your nice suggestion, agree for a formal VOTE.

About the jenkins report. The pr is too big (over 20000 lines) for jenkins to 
apply the patch (https://github.com/apache/hadoop/pull/7194).


My plan is to split it into 2 pull request. The first pr includes the core 
implementation and the second pr includes the unit tests.

The first would be: https://github.com/apache/hadoop/pull/7504.

On 2025/03/14 03:43:34 Xiaoqiao He wrote:
> Thanks Jinglun for your work. LGTM. +1 from my side.
> 
> BTW, Please check Jenkins report, it is better to get +1 from Jenkins
> before check in.
> 
> cc @PJ Fanning @Steve Loughran @slfan1989 would you mind taking another check?
> 
> If there are no more other comments or concerns, I just suggest
> launching another
> formal VOTE thread.
> 
> Best Regards,
> - He Xiaoqiao
> 
> On Thu, Mar 13, 2025 at 11:55 AM jinglun <jinglun...@qq.com> wrote:
> >
> > Hello everyone, sorry to bother you and thanks for your attention. After 
> > all the concerns are targeted, I think its time to continue discussing 
> > about involving hadoop-tos to hadoop codebase. Please let me know your 
> > thoughts, thanks.
> >
> > Cc hexiaoq...@apache.org, fannin...@apache.org, ste...@cloudera.com, 
> > slfan1...@foxmail.com.
> >
> > On 2025/03/11 07:06:36 Jinglun wrote:
> > > **Update**
> > >
> > > Hello everyone, thanks a lot for your attention. I'm happy to share the 
> > > progress here.
> > >
> > > 1. Kotlin is removed. The library ve-tos-java-sdk has been updated to 
> > > ve-tos-java-sdk-hadoop 
> > > (https://github.com/volcengine/ve-tos-java-sdk/tree/ve-tos-java-sdk-hadoop),
> > >  which is developed based on apache http client.  The dependencies are as 
> > > follows.
> > > ```
> > > [INFO] +- com.volcengine:ve-tos-java-sdk-hadoop:jar:2.8.9:compile
> > > [INFO] |  \- org.apache.httpcomponents.client5:httpclient5:jar:5.3:compile
> > > [INFO] |     +- 
> > > org.apache.httpcomponents.core5:httpcore5:jar:5.2.4:compile
> > > [INFO] |     \- 
> > > org.apache.httpcomponents.core5:httpcore5-h2:jar:5.2.4:compile
> > > ```
> > >
> > > 2. Unit tests have been updated to junit5. One exception is the contract 
> > > tests. The contract tests depends on the abstract contract which are 
> > > still using junit4, so they need to use junit4 too. After the abstract 
> > > contracts are updated, the hadoop-tos module could all changed to junit5.
> > >
> > >
> > > The hadoop-tos implementation could be found at: 
> > > https://github.com/apache/hadoop/pull/7194.
> > >
> > > On 2025/02/18 12:19:11 Jinglun wrote:
> > > > **Update**
> > > >
> > > > I discussed with the main tos sdk developer xiang(evansxi...@126.com), 
> > > > he would be happy to provide a new sdk depends on apache http-client. 
> > > > The new sdk would be released at 3/15.
> > > >
> > > > Thanks xiang for your help.
> > > >
> > > > If there are any other questions, please feel free to comment.
> > > >
> > > > On 2025/02/17 03:27:15 Jinglun wrote:
> > > > > Thanks PJ Fanning and slfan for your suggestions !
> > > > >
> > > > > > Would it be possible to consider using a lightweight HTTP client 
> > > > > > instead?
> > > > > Thanks for reminding me, this makes sense to me. I'll try to solve it 
> > > > > as soon as possible.
> > > > >
> > > > > > We are upgrading to Junit5 and need TOS's Unit Tests to be 
> > > > > > developed in the Junit5 way,
> > > > > Thanks for your nice suggestion. Let me fix this.
> > > > >
> > > > > I will update as soon as possible. If you have any other questions, 
> > > > > please feel free to comment.
> > > > >
> > > > > On 2025/02/15 02:59:02 slfan1989 wrote:
> > > > > > Thanks to Jinglun for initiating the discussion on TOS.
> > > > > >
> > > > > > +1 from my personal perspective.
> > > > > >
> > > > > > However, considering that we are upgrading to Junit5 and need TOS's 
> > > > > > Unit
> > > > > > Tests to be developed in the Junit5 way, we can discuss it together 
> > > > > > under
> > > > > > the relevant PR.
> > > > > >
> > > > > > Regarding PJ Fanning's suggestion, I think we should pay attention 
> > > > > > to it.
> > > > > > He has a deeper insight into this part.
> > > > > >
> > > > > > Best Regards,
> > > > > > - Shilun Fan
> > > > > >
> > > > > > On Sat, Feb 14, 2025 at 20:58 PM PJ Fanning <fannin...@apache.org> 
> > > > > > wrote:
> > > > > >
> > > > > > >
> > > > > > > Just one thing to note is that we recently removed or reduced the
> > > > > > > okhttp3 dependency in Hadoop because the kotlin dependency brings 
> > > > > > > in
> > > > > > > big jars and more complicated management of transitive 
> > > > > > > dependencies.
> > > > > > > Would it be possible to consider using a lightweight HTTP client
> > > > > > > instead? The built-in Java client or Apache HttpClient are 
> > > > > > > examples.
> > > > > > >
> > > > > > > https://issues.apache.org/jira/browse/HADOOP-18890
> > > > > > >
> > > > > > > On Fri, 14 Feb 2025 at 12:41, Jinglun wrote:
> > > > > > > >
> > > > > > > > Thanks xiaoqiao and steve for your attention and comments. Let 
> > > > > > > > me answer
> > > > > > > the dependencies and tests.
> > > > > > > >
> > > > > > > > **Dependencies**
> > > > > > > > Hadoop-tos involves a new dependency
> > > > > > > com.volcengine:ve-tos-java-sdk:2.8.6. It is an open source 
> > > > > > > project with
> > > > > > > apache 2.0 license (
> > > > > > > https://github.com/volcengine/ve-tos-java-sdk/blob/main/LICENSE).
> > > > > > > >
> > > > > > > > Here are the dependencies involved by
> > > > > > > com.volcengine:ve-tos-java-sdk:2.8.6. They (okhttp, okio, kotlin, 
> > > > > > > jackson)
> > > > > > > are open source with apache 2.0 too.
> > > > > > > > [INFO] +- com.volcengine:ve-tos-java-sdk:jar:2.8.7:compile
> > > > > > > > [INFO] | +- com.squareup.okhttp3:okhttp:jar:4.10.0:compile
> > > > > > > > [INFO] | | +- com.squareup.okio:okio-jvm:jar:3.0.0:compile
> > > > > > > > [INFO] | | | \- 
> > > > > > > > org.jetbrains.kotlin:kotlin-stdlib-jdk8:jar:1.6.20:test
> > > > > > > > [INFO] | | | \- 
> > > > > > > > org.jetbrains.kotlin:kotlin-stdlib-jdk7:jar:1.6.20:test
> > > > > > > > [INFO] | | \- 
> > > > > > > > org.jetbrains.kotlin:kotlin-stdlib:jar:1.6.20:compile
> > > > > > > > [INFO] | | \- org.jetbrains:annotations:jar:13.0:compile
> > > > > > > > [INFO] | \-
> > > > > > > com.fasterxml.jackson.core:jackson-annotations:jar:2.12.7:compile
> > > > > > > > [INFO] +- 
> > > > > > > > org.jetbrains.kotlin:kotlin-stdlib-common:jar:1.6.20:compile
> > > > > > > >
> > > > > > > > **How is it tested**
> > > > > > > > The hadoop-tos module has a complete unit test set, including 
> > > > > > > > the
> > > > > > > contracts and extended test cases. To run it, we need a machine 
> > > > > > > that can
> > > > > > > connect to TOS. Setting the 6 environment variables below.
> > > > > > > > ```
> > > > > > > > export TOS_ACCESS_KEY_ID={YOUR_ACCESS_KEY}
> > > > > > > > export TOS_SECRET_ACCESS_KEY={YOUR_SECRET_ACCESS_KEY}
> > > > > > > > export TOS_ENDPOINT={TOS_SERVICE_ENDPOINT}
> > > > > > > > export FILE_STORAGE_ROOT=/tmp/local_dev/
> > > > > > > > export TOS_BUCKET={YOUR_BUCKET_NAME}
> > > > > > > > export TOS_UNIT_TEST_ENABLED=true
> > > > > > > > ```
> > > > > > > > Then cd to hadoop project root directory, and run the test 
> > > > > > > > command below.
> > > > > > > > ```
> > > > > > > > mvn -Dtest=org.apache.hadoop.fs.tosfs.** test -pl
> > > > > > > org.apache.hadoop:hadoop-tos
> > > > > > > > ```
> > > > > > > > I also test it in a real hadoop environment. The document 
> > > > > > > > (index.md)
> > > > > > > describes how to set jars and configure keys. Common tests 
> > > > > > > include: shell
> > > > > > > commands, Terasort, DFSIO, NNBench, Distcp, etc.
> > > > > > > >
> > > > > > > > **Test Environment**
> > > > > > > > We need a VolcanoEngine account to run all the test cases. I 
> > > > > > > > can provide
> > > > > > > an environment for test. Please let me know if you need to test 
> > > > > > > hadoop-tos (
> > > > > > > jing...@apache.org).
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On 2025/02/13 18:21:57 Steve Loughran wrote:
> > > > > > > > > Sounds good, though expect no commitment from me to review 
> > > > > > > > > anything.
> > > > > > > > > My main concerns are about dependency libraries (what are 
> > > > > > > > > they?) and
> > > > > > > > > testing.
> > > > > > > > >
> > > > > > > > > On Tue, 11 Feb 2025 at 05:10, Xiaoqiao He wrote:
> > > > > > > > >
> > > > > > > > > > Thanks Jinglun for your work. Basically +1 from me to 
> > > > > > > > > > involve it
> > > > > > > into the
> > > > > > > > > > Hadoop codebase.
> > > > > > > > > > a. After a quick review of JIRA and PR, I think it is solid 
> > > > > > > > > > including
> > > > > > > > > > document and code style.
> > > > > > > > > > b. Contributors involved here are diverse who are from 
> > > > > > > > > > different
> > > > > > > projects
> > > > > > > > > > and companies, and active enough.
> > > > > > > > > > c. Community with Jinlun offline many times, and IMO he 
> > > > > > > > > > could be
> > > > > > > > > > responsible to review and test about this module.
> > > > > > > > > > Beside that, just suggest following the Hadoop 
> > > > > > > > > > guidelines[1] to
> > > > > > > develop
> > > > > > > > > > the new features.
> > > > > > > > > >
> > > > > > > > > > @Steve Loughran @Shilun Fan leave
> > > > > > > > > > some comments including some concerns in JIRA, would you 
> > > > > > > > > > mind giving
> > > > > > > more
> > > > > > > > > > suggestions for this discussion?
> > > > > > > > > > Thanks.
> > > > > > > > > >
> > > > > > > > > > Best Regards,
> > > > > > > > > > - He Xiaoqiao
> > > > > > > > > >
> > > > > > > > > > [1] https://hadoop.apache.org/bylaws.html
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Sun, Jan 26, 2025 at 3:39 PM jinglun wrote:
> > > > > > > > > >
> > > > > > > > > >> Hello everyone, I'd like to discuss the integration of 
> > > > > > > > > >> volcano
> > > > > > > engine tos
> > > > > > > > > >> in hadoop.
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >> Volcano Engine is a fast growing cloud vendor launched by
> > > > > > > ByteDance, and
> > > > > > > > > >> TOS is the object storage service of Volcano Engine. A 
> > > > > > > > > >> common way
> > > > > > > is to
> > > > > > > > > >> store data into TOS and run Hadoop/Spark/Flink 
> > > > > > > > > >> applications to
> > > > > > > access TOS.
> > > > > > > > > >> But there is no original support for TOS in hadoop, thus 
> > > > > > > > > >> it is not
> > > > > > > easy for
> > > > > > > > > >> users to build their Big Data System based on TOS.
> > > > > > > > > >>
> > > > > > > > > >> My proposal is to integrate TOS with Hadoop to help users 
> > > > > > > > > >> run their
> > > > > > > > > >> applications on TOS. Users only need to do some simple
> > > > > > > configuration, then
> > > > > > > > > >> their applications can read/write TOS without any code 
> > > > > > > > > >> change. This
> > > > > > > work is
> > > > > > > > > >> similar to AWS S3, AzureBlob, AliyunOSS, Tencnet COS and
> > > > > > > HuaweiCloud Object
> > > > > > > > > >> Storage in Hadoop.
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >> More details could be found at
> > > > > > > > > >> https://issues.apache.org/jira/browse/HADOOP-19236.
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >> 1. What is the progress of the work now?
> > > > > > > > > >> The work is currently finished at branch HADOOP_19236. It 
> > > > > > > > > >> is
> > > > > > > developed by
> > > > > > > > > >> the EMR team of Volcano Engine and served many users from 
> > > > > > > > > >> both
> > > > > > > cloud and
> > > > > > > > > >> IDC for more than 2 years.
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >> 2. How is the long-term maintenance and testing guaranteed?
> > > > > > > > > >> The contributors are opensource friendly, including 
> > > > > > > > > >> ZhengHu(PMC
> > > > > > > > > >> of HBase and Iceberg), Jinglun(Committer of
> > > > > > > Hadoop), SunXin(Committer
> > > > > > > > > >> of HBase), XianyinXin(Contributor of Spark), Rascal 
> > > > > > > > > >> Wu(Contributor
> > > > > > > of
> > > > > > > > > >> Flink), FangBo(Contributor of Hive) and Yuanzhihuan. We 
> > > > > > > > > >> will all be
> > > > > > > > > >> involved in the long-term maintenance of this work. As 
> > > > > > > > > >> time goes by,
> > > > > > > > > >> more people from the EMR team and the hadoop-tos users may 
> > > > > > > > > >> join
> > > > > > > this work.
> > > > > > > > > >> So I'm confident at the long-term maintenance and testing.
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >> 3. Why should hadoop-tos interaged to hadoop codebase? 
> > > > > > > > > >> Shall we use
> > > > > > > an
> > > > > > > > > >> independent project?
> > > > > > > > > >> Integration is for a better user experience. First, users 
> > > > > > > > > >> don't
> > > > > > > need to
> > > > > > > > > >> go to another repo to find the tos support. Second, users 
> > > > > > > > > >> don't
> > > > > > > need to
> > > > > > > > > >> worry about the versions mapping between hadoop and 
> > > > > > > > > >> hadoop-tos.
> > > > > > > Finally, a
> > > > > > > > > >> connector provided by hadoop community is more reliable and
> > > > > > > > > >> trustworthy.
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >> If you have any question, concern or any thing else that 
> > > > > > > > > >> is unclear,
> > > > > > > > > >> please let me know. Sincerely looking forward to your 
> > > > > > > > > >> reply, thanks
> > > > > > > > > >> very much.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > > ---------------------------------------------------------------------
> > > > > > > > To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
> > > > > > > > For additional commands, e-mail: 
> > > > > > > > common-dev-h...@hadoop.apache.org
> > > > > > > >
> > > > > > >
> > > > > > > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
> > > > > > > For additional commands, e-mail: common-dev-h...@hadoop.apache.org
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
> > > > > For additional commands, e-mail: common-dev-h...@hadoop.apache.org
> > > > >
> > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
> > > > For additional commands, e-mail: common-dev-h...@hadoop.apache.org
> > > >
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
> > > For additional commands, e-mail: common-dev-h...@hadoop.apache.org
> > >
> > >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: common-dev-h...@hadoop.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

Re: [DISCUSS] Integration of Volcano Engine TOS in Hadoop.

Reply via email to