Re: [PR] [#5585] improvement(bundles): Refactor bundle jars and provide core jars that does not contains hadoop-{aws,gcp,aliyun,azure} [gravitino]

via GitHub Tue, 24 Dec 2024 22:50:04 -0800


jerryshao commented on code in PR #5806:
URL: https://github.com/apache/gravitino/pull/5806#discussion_r1897136150



##########
bundles/aliyun-hadoop-bundle/build.gradle.kts:
##########
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+import com.github.jengelman.gradle.plugins.shadow.tasks.ShadowJar
+
+plugins {
+  `maven-publish`
+  id("java")
+  alias(libs.plugins.shadow)
+}
+
+dependencies {
+  implementation(project(":bundles:aliyun-bundle"))
+  implementation(libs.hadoop3.oss)
+  implementation(libs.hadoop3.client.api)
+  implementation(libs.hadoop3.client.runtime)
+
+  implementation(libs.httpclient)
+  implementation(libs.commons.collections3)

Review Comment:
   Can we make all of them alphabetically ordered?



##########
bundles/aliyun-bundle/build.gradle.kts:
##########
@@ -47,10 +45,21 @@ dependencies {
   // 
org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem.initialize(AliyunOSSFileSystem.java:323)
   // org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3611)
   implementation(libs.commons.lang3)
+  implementation(libs.httpclient)
+  implementation(libs.commons.collections3)
+
+  implementation(libs.guava)
+  implementation(libs.jackson.databind)
+  implementation(libs.jackson.annotations)
+  implementation(libs.jackson.datatype.jdk8)
+  implementation(libs.jackson.datatype.jsr310)
 
   implementation(project(":catalogs:catalog-common")) {
     exclude("*")
   }
+  implementation(project(":catalogs:hadoop-common")) {
+    exclude("*")
+  }

Review Comment:
   Can we move this module dependencies to the top of the `implementation` 
block.



##########
catalogs/catalog-hadoop/build.gradle.kts:
##########
@@ -45,25 +45,10 @@ dependencies {
     exclude(group = "*")
   }
 
-  implementation(libs.hadoop3.common) {
-    exclude("com.sun.jersey")
-    exclude("javax.servlet", "servlet-api")
-    exclude("org.eclipse.jetty", "*")
-    exclude("org.apache.hadoop", "hadoop-auth")
-    exclude("org.apache.curator", "curator-client")
-    exclude("org.apache.curator", "curator-framework")
-    exclude("org.apache.curator", "curator-recipes")
-    exclude("org.apache.avro", "avro")
-    exclude("com.sun.jersey", "jersey-servlet")
-  }
-
-  implementation(libs.hadoop3.client) {
-    exclude("org.apache.hadoop", "hadoop-mapreduce-client-core")
-    exclude("org.apache.hadoop", "hadoop-mapreduce-client-jobclient")
-    exclude("org.apache.hadoop", "hadoop-yarn-api")
-    exclude("org.apache.hadoop", "hadoop-yarn-client")
-    exclude("com.squareup.okhttp", "okhttp")
-  }
+  implementation(libs.commons.lang3)
+  implementation(libs.commons.io)
+  implementation(libs.hadoop3.client.api)
+  implementation(libs.hadoop3.client.runtime)

Review Comment:
   Also here.



##########
docs/cloud-storage-fileset-example.md:
##########
@@ -0,0 +1,681 @@
+---
+title: "How to use cloud storage fileset"
+slug: /how-to-use-cloud-storage-fileset
+keyword: fileset S3 GCS ADLS OSS
+license: "This software is licensed under the Apache License version 2."
+---
+
+This document aims to provide a comprehensive guide on how to use cloud 
storage fileset created by Gravitino, it usually contains the following 
sections:
+
+
+## Start up Gravitino server
+
+### Start up Gravitino server
+
+Before running the Gravitino server, you need to put the following jars into 
the fileset class path located in `${GRAVITINO_HOME}/catalogs/hadoop/libs`. For 
example, if you are using S3, you need to put 
gravitino-aws-hadoop-bundles-{version}.jar into the fileset class path.
+
+
+| Storage type | Description                                                   
| Jar file                                                                      
                                           | Since Version    |
+|--------------|---------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|------------------|
+| Local file   | The local file system.                                        
| (none)                                                                        
                                           | 0.5.0            |
+| HDFS         | HDFS file system.                                             
| (none)                                                                        
                                           | 0.5.0            |
+| S3           | AWS S3 storage.                                               
| 
[gravitino-aws-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aws-bundle)
       | 0.8.0-incubating |
+| GCS          | Google Cloud Storage.                                         
| 
[gravitino-gcp-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-gcp-bundle)
       | 0.8.0-incubating |
+| OSS          | Aliyun OSS storage.                                           
| 
[gravitino-aliyun-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aliyun-bundle)
 | 0.8.0-incubating |
+| ABS          | Azure Blob Storage (aka. ABS, or Azure Data Lake Storage (v2) 
| 
[gravitino-azure-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-azure-bundle)
   | 0.8.0-incubating |
+
+After putting the jars into the fileset class path, you can start up the 
Gravitino server by running the following command:
+
+```shell
+cd ${GRAVITINO_HOME}
+bin/gravitino.sh start
+```
+
+### Bundle jars
+
+`gravitino-{aws,gcp,aliyun,azure}-hadoop-bundle` are the jars that contain all 
the necessary classes to access the corresponding cloud storages, for instance, 
gravitino-aws-hadoop-bundle contains the all necessary classes like 
`hadoop-common`(hadoop-3.3.1) and `hadoop-aws` to access the S3 storage.
+**They are used in the scenario where there is no hadoop environment in the 
runtime.**
+
+**If there is already hadoop environment in the runtime, you can use the 
`gravitino-{aws,gcp,aliyun,azure}-bundle.jar` that does not contain the cloud 
storage classes (like hadoop-aws) and hadoop environment, you can manually add 
the necessary jars to the classpath.**
+
+The following picture demonstrates what jars are necessary for different cloud 
storage filesets:

Review Comment:
   Will you add a picture here?



##########
bundles/aws-bundle/src/main/java/org/apache/gravitino/s3/fs/S3FileSystemProvider.java:
##########
@@ -31,30 +36,81 @@
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.fs.s3a.Constants;
 import org.apache.hadoop.fs.s3a.S3AFileSystem;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
 
 public class S3FileSystemProvider implements FileSystemProvider {
 
+  private static final Logger LOGGER = 
LoggerFactory.getLogger(S3FileSystemProvider.class);
+
   @VisibleForTesting
   public static final Map<String, String> GRAVITINO_KEY_TO_S3_HADOOP_KEY =
       ImmutableMap.of(
           S3Properties.GRAVITINO_S3_ENDPOINT, Constants.ENDPOINT,
           S3Properties.GRAVITINO_S3_ACCESS_KEY_ID, Constants.ACCESS_KEY,
           S3Properties.GRAVITINO_S3_SECRET_ACCESS_KEY, Constants.SECRET_KEY);
 
+  // We can't use Constants.AWS_CREDENTIALS_PROVIDER directly, as in 2.7, this 
key does not exist.
+  private static final String S3_CREDENTIAL_KEY = 
"fs.s3a.aws.credentials.provider";
+  private static final String S3_SIMPLE_CREDENTIAL =
+      "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider";
+
   @Override
   public FileSystem getFileSystem(Path path, Map<String, String> config) 
throws IOException {
     Configuration configuration = new Configuration();
     Map<String, String> hadoopConfMap =
         FileSystemUtils.toHadoopConfigMap(config, 
GRAVITINO_KEY_TO_S3_HADOOP_KEY);
 
-    if (!hadoopConfMap.containsKey(Constants.AWS_CREDENTIALS_PROVIDER)) {
-      configuration.set(
-          Constants.AWS_CREDENTIALS_PROVIDER, 
Constants.ASSUMED_ROLE_CREDENTIALS_DEFAULT);
+    if (!hadoopConfMap.containsKey(S3_CREDENTIAL_KEY)) {
+      hadoopConfMap.put(S3_CREDENTIAL_KEY, S3_SIMPLE_CREDENTIAL);
     }
+
     hadoopConfMap.forEach(configuration::set);
+
+    // Hadoop-aws 2 does not support IAMInstanceCredentialsProvider
+    checkAndSetCredentialProvider(configuration);
+
     return S3AFileSystem.newInstance(path.toUri(), configuration);
   }
 
+  private void checkAndSetCredentialProvider(Configuration configuration) {

Review Comment:
   Is it related to this PR?



##########
bundles/aliyun-bundle/build.gradle.kts:
##########
@@ -62,6 +71,10 @@ tasks.withType(ShadowJar::class.java) {
   // Relocate dependencies to avoid conflicts
   relocate("org.jdom", "org.apache.gravitino.shaded.org.jdom")
   relocate("org.apache.commons.lang3", 
"org.apache.gravitino.shaded.org.apache.commons.lang3")
+  relocate("com.fasterxml.jackson", 
"org.apache.gravitino.shaded.com.fasterxml.jackson")
+  relocate("com.google.common", 
"org.apache.gravitino.shaded.com.google.common")
+  relocate("org.apache.http", "org.apache.gravitino.shaded.org.apache.http")
+  relocate("org.apache.commons.collections", 
"org.apache.gravitino.shaded.org.apache.commons.collections")

Review Comment:
   Shall we relocate to `org.apache.gravitino.aliyun.shaded` to avoid conflict 
with other bundle jars?



##########
settings.gradle.kts:
##########
@@ -77,8 +77,12 @@ project(":spark-connector:spark-runtime-3.5").projectDir = 
file("spark-connector
 include("web:web", "web:integration-test")
 include("docs")
 include("integration-test-common")
+include(":bundles:aws-hadoop-bundle")
 include(":bundles:aws-bundle")

Review Comment:
   Can you group them together like `include(":bundles:aws-hadoop-bundle", 
":bundles:aws-bundle")`



##########
bundles/aliyun-hadoop-bundle/build.gradle.kts:
##########
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+import com.github.jengelman.gradle.plugins.shadow.tasks.ShadowJar
+
+plugins {
+  `maven-publish`
+  id("java")
+  alias(libs.plugins.shadow)
+}
+
+dependencies {
+  implementation(project(":bundles:aliyun-bundle"))
+  implementation(libs.hadoop3.oss)
+  implementation(libs.hadoop3.client.api)
+  implementation(libs.hadoop3.client.runtime)
+
+  implementation(libs.httpclient)
+  implementation(libs.commons.collections3)
+}
+
+tasks.withType(ShadowJar::class.java) {
+  isZip64 = true
+  configurations = listOf(project.configurations.runtimeClasspath.get())
+  archiveClassifier.set("")
+  mergeServiceFiles()
+
+  // Relocate dependencies to avoid conflicts
+  relocate("org.jdom", "org.apache.gravitino.shaded.org.jdom")
+  relocate("org.apache.commons.lang3", 
"org.apache.gravitino.shaded.org.apache.commons.lang3")
+  relocate("com.fasterxml.jackson", 
"org.apache.gravitino.shaded.com.fasterxml.jackson")
+  relocate("com.google.common", 
"org.apache.gravitino.shaded.com.google.common")
+  relocate("org.apache.http", "org.apache.gravitino.shaded.org.apache.http")
+  relocate("org.apache.commons.collections", 
"org.apache.gravitino.shaded.org.apache.commons.collections")

Review Comment:
   Also here. I think we'd better relocate as 
`org.apache.gravitino.{xxx}.shaded.` to avoid conflicts if we put all the 
bundle jars into the same classloader.



##########
clients/filesystem-hadoop3-runtime/build.gradle.kts:
##########
@@ -28,6 +28,8 @@ plugins {
 dependencies {
   implementation(project(":clients:filesystem-hadoop3"))
   implementation(project(":clients:client-java-runtime", configuration = 
"shadow"))
+

Review Comment:
   Please remove unnecessary blank line here.



##########
docs/cloud-storage-fileset-example.md:
##########
@@ -0,0 +1,681 @@
+---
+title: "How to use cloud storage fileset"
+slug: /how-to-use-cloud-storage-fileset
+keyword: fileset S3 GCS ADLS OSS
+license: "This software is licensed under the Apache License version 2."
+---
+
+This document aims to provide a comprehensive guide on how to use cloud 
storage fileset created by Gravitino, it usually contains the following 
sections:
+
+
+## Start up Gravitino server
+
+### Start up Gravitino server
+
+Before running the Gravitino server, you need to put the following jars into 
the fileset class path located in `${GRAVITINO_HOME}/catalogs/hadoop/libs`. For 
example, if you are using S3, you need to put 
gravitino-aws-hadoop-bundles-{version}.jar into the fileset class path.
+
+
+| Storage type | Description                                                   
| Jar file                                                                      
                                           | Since Version    |
+|--------------|---------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|------------------|
+| Local file   | The local file system.                                        
| (none)                                                                        
                                           | 0.5.0            |
+| HDFS         | HDFS file system.                                             
| (none)                                                                        
                                           | 0.5.0            |
+| S3           | AWS S3 storage.                                               
| 
[gravitino-aws-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aws-bundle)
       | 0.8.0-incubating |
+| GCS          | Google Cloud Storage.                                         
| 
[gravitino-gcp-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-gcp-bundle)
       | 0.8.0-incubating |
+| OSS          | Aliyun OSS storage.                                           
| 
[gravitino-aliyun-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aliyun-bundle)
 | 0.8.0-incubating |
+| ABS          | Azure Blob Storage (aka. ABS, or Azure Data Lake Storage (v2) 
| 
[gravitino-azure-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-azure-bundle)
   | 0.8.0-incubating |
+
+After putting the jars into the fileset class path, you can start up the 
Gravitino server by running the following command:
+
+```shell
+cd ${GRAVITINO_HOME}
+bin/gravitino.sh start
+```
+
+### Bundle jars
+
+`gravitino-{aws,gcp,aliyun,azure}-hadoop-bundle` are the jars that contain all 
the necessary classes to access the corresponding cloud storages, for instance, 
gravitino-aws-hadoop-bundle contains the all necessary classes like 
`hadoop-common`(hadoop-3.3.1) and `hadoop-aws` to access the S3 storage.
+**They are used in the scenario where there is no hadoop environment in the 
runtime.**
+
+**If there is already hadoop environment in the runtime, you can use the 
`gravitino-{aws,gcp,aliyun,azure}-bundle.jar` that does not contain the cloud 
storage classes (like hadoop-aws) and hadoop environment, you can manually add 
the necessary jars to the classpath.**
+
+The following picture demonstrates what jars are necessary for different cloud 
storage filesets:
+
+| Hadoop runtime version | S3                                                  
                                                                                
   | GCS                                                                        
                                    | OSS                                       
                                                                                
                      | ABS                                                     
                                                            |
+|------------------------|----------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------|
+| No Hadoop environment  | gravitino-aws-hadoop-bundle-{gravitino-version}.jar 
                                                                                
   | gravitino-gcp-hadoop-bundle-{gravitino-version}.jar                        
                                    | 
gravitino-aliyun-hadoop-bundle-{gravitino-version}.jar                          
                                                                | 
gravitino-azure-hadoop-bundle-{gravitino-version}.jar                           
                                    |
+| 2.x, 3.x               | gravitino-aws-bundle-{gravitino-version}.jar, 
hadoop-aws-{hadoop-version}.jar, aws-sdk-java-{version} and other necessary 
dependencies | gravitino-gcp-bundle-{gravitino-version}.jar, 
gcs-connector-{hadoop-version}.jar, other necessary dependencies | 
gravitino-aliyun-bundle-{gravitino-version}.jar, 
hadoop-aliyun-{hadoop-version}.jar, aliyun-sdk-java-{version} and other 
necessary dependencies | gravitino-azure-bundle-{gravitino-version}.jar, 
hadoop-azure-{hadoop-version}.jar, and other necessary dependencies |
+
+For `hadoop-aws-{version}.jar`, `hadoop-azure-{version}.jar` and 
`hadoop-aliyun-{version}.jar` and related dependencies, you can get it from 
${HADOOP_HOME}/share/hadoop/tools/lib/ directory.
+For `gcs-connector`, you can download it from the [GCS 
connector](https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-2.2.22-shaded.jar)
 for hadoop2 or hadoop3. 
+
+If there still have some issues, please report it to the Gravitino community 
and create an issue. 
+
+:::note
+Gravitino server uses Hadoop 3.3.1, and you only need to put the corresponding 
bundle jars into the fileset class path. 
+:::
+
+
+## Create fileset catalogs
+
+Once the Gravitino server is started, you can create the corresponding fileset 
by the following sentence:
+
+
+### Create a S3 fileset catalog
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+  "name": "catalog",
+  "type": "FILESET",
+  "comment": "comment",
+  "provider": "hadoop",
+  "properties": {
+    "location": "s3a://bucket/root",
+    "s3-access-key-id": "access_key",
+    "s3-secret-access-key": "secret_key",
+    "s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com";,
+    "filesystem-providers": "s3"
+  }
+}' http://localhost:8090/api/metalakes/metalake/catalogs
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+    .builder("http://localhost:8090";)
+    .withMetalake("metalake")
+    .build();
+
+s3Properties = ImmutableMap.<String, String>builder()
+    .put("location", "s3a://bucket/root")
+    .put("s3-access-key-id", "access_key")
+    .put("s3-secret-access-key", "secret_key")
+    .put("s3-endpoint", "http://s3.ap-northeast-1.amazonaws.com";)
+    .put("filesystem-providers", "s3")
+    .build();
+
+Catalog s3Catalog = gravitinoClient.createCatalog("catalog",
+    Type.FILESET,
+    "hadoop", // provider, Gravitino only supports "hadoop" for now.
+    "This is a S3 fileset catalog",
+    s3Properties);
+// ...
+
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient = 
GravitinoClient(uri="http://localhost:8090";, metalake_name="metalake")
+s3_properties = {
+    "location": "s3a://bucket/root",
+    "s3-access-key-id": "access_key"
+    "s3-secret-access-key": "secret_key",
+    "s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com";
+}
+
+s3_catalog = gravitino_client.create_catalog(name="catalog",
+                                             type=Catalog.Type.FILESET,
+                                             provider="hadoop",
+                                             comment="This is a S3 fileset 
catalog",
+                                             properties=s3_properties)
+
+```
+
+</TabItem>
+</Tabs>
+
+:::note
+The value of location should always start with `s3a` NOT `s3`, for instance, 
`s3a://bucket/root`. Value like `s3://bucket/root` is not supported.
+:::
+
+### Create a GCS fileset catalog
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+  "name": "catalog",
+  "type": "FILESET",
+  "comment": "comment",
+  "provider": "hadoop",
+  "properties": {
+    "location": "gs://bucket/root",
+    "gcs-service-account-file": "path_of_gcs_service_account_file",
+    "filesystem-providers": "gcs"
+  }
+}' http://localhost:8090/api/metalakes/metalake/catalogs
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+    .builder("http://localhost:8090";)
+    .withMetalake("metalake")
+    .build();
+
+gcsProperties = ImmutableMap.<String, String>builder()
+    .put("location", "gs://bucket/root")
+    .put("gcs-service-account-file", "path_of_gcs_service_account_file")
+    .put("filesystem-providers", "gcs")
+    .build();
+
+Catalog gcsCatalog = gravitinoClient.createCatalog("catalog",
+    Type.FILESET,
+    "hadoop", // provider, Gravitino only supports "hadoop" for now.
+    "This is a GCS fileset catalog",
+    gcsProperties);
+// ...
+
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient = 
GravitinoClient(uri="http://localhost:8090";, metalake_name="metalake")
+
+gcs_properties = {
+    "location": "gcs://bucket/root",
+    "gcs_service_account_file": "path_of_gcs_service_account_file"
+}
+
+s3_catalog = gravitino_client.create_catalog(name="catalog",
+                                             type=Catalog.Type.FILESET,
+                                             provider="hadoop",
+                                             comment="This is a GCS fileset 
catalog",
+                                             properties=gcs_properties)
+
+```
+
+</TabItem>
+</Tabs>
+
+### Create an OSS fileset catalog
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+  "name": "catalog",
+  "type": "FILESET",
+  "comment": "comment",
+  "provider": "hadoop",
+  "properties": {
+    "location": "oss://bucket/root",
+    "oss-access-key-id": "access_key",
+    "oss-secret-access-key": "secret_key",
+    "oss-endpoint": "http://oss-cn-hangzhou.aliyuncs.com";,
+    "filesystem-providers": "oss"
+  }
+}' http://localhost:8090/api/metalakes/metalake/catalogs
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+    .builder("http://localhost:8090";)
+    .withMetalake("metalake")
+    .build();
+
+ossProperties = ImmutableMap.<String, String>builder()
+    .put("location", "oss://bucket/root")
+    .put("oss-access-key-id", "access_key")
+    .put("oss-secret-access-key", "secret_key")
+    .put("oss-endpoint", "http://oss-cn-hangzhou.aliyuncs.com";)
+    .put("filesystem-providers", "oss")
+    .build();
+
+Catalog ossProperties = gravitinoClient.createCatalog("catalog",
+    Type.FILESET,
+    "hadoop", // provider, Gravitino only supports "hadoop" for now.
+    "This is a OSS fileset catalog",
+    ossProperties);
+// ...
+
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient = 
GravitinoClient(uri="http://localhost:8090";, metalake_name="metalake")
+oss_properties = {
+    "location": "oss://bucket/root",
+    "oss-access-key-id": "access_key"
+    "oss-secret-access-key": "secret_key",
+    "oss-endpoint": "http://oss-cn-hangzhou.aliyuncs.com";
+}
+
+oss_catalog = gravitino_client.create_catalog(name="catalog",
+                                             type=Catalog.Type.FILESET,
+                                             provider="hadoop",
+                                             comment="This is a OSS fileset 
catalog",
+                                             properties=oss_properties)
+
+```
+
+### Create an ABS (Azure Blob Storage or ADLS) fileset catalog
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+  "name": "catalog",
+  "type": "FILESET",
+  "comment": "comment",
+  "provider": "hadoop",
+  "properties": {
+    "location": "abfss://container/root",
+    "abs-account-name": "The account name of the Azure Blob Storage",
+    "abs-account-key": "The account key of the Azure Blob Storage",
+    "filesystem-providers": "abs"
+  }
+}' http://localhost:8090/api/metalakes/metalake/catalogs
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+    .builder("http://localhost:8090";)
+    .withMetalake("metalake")
+    .build();
+
+absProperties = ImmutableMap.<String, String>builder()
+    .put("location", "abfss://container/root")
+    .put("abs-account-name", "The account name of the Azure Blob Storage")
+    .put("abs-account-key", "The account key of the Azure Blob Storage")
+    .put("filesystem-providers", "abs")
+    .build();
+
+Catalog gcsCatalog = gravitinoClient.createCatalog("catalog",
+    Type.FILESET,
+    "hadoop", // provider, Gravitino only supports "hadoop" for now.
+    "This is a Azure Blob storage fileset catalog",
+    absProperties);
+// ...
+
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient = 
GravitinoClient(uri="http://localhost:8090";, metalake_name="metalake")
+
+abs_properties = {
+    "location": "gcs://bucket/root",
+    "abs_account_name": "The account name of the Azure Blob Storage",
+    "abs_account_key": "The account key of the Azure Blob Storage"  
+}
+
+abs_catalog = gravitino_client.create_catalog(name="catalog",
+                                             type=Catalog.Type.FILESET,
+                                             provider="hadoop",
+                                             comment="This is a Azure Blob 
Storage fileset catalog",
+                                             properties=abs_properties)
+
+```
+
+</TabItem>
+</Tabs>
+
+note:::
+- The prefix of an ABS (Azure Blob Storage or ADLS (v2)) location should 
always start with `abfss` NOT `abfs`, for instance, `abfss://container/root`. 
Value like `abfs://container/root` is not supported.
+- The prefix of an AWS S3 location should always start with `s3a` NOT `s3`, 
for instance, `s3a://bucket/root`. Value like `s3://bucket/root` is not 
supported.
+- The prefix of an Aliyun OSS location should always start with `oss` for 
instance, `oss://bucket/root`.
+- The prefix of a GCS location should always start with `gs` for instance, 
`gs://bucket/root`.
+:::
+
+
+## Create fileset schema
+
+This part is the same for all cloud storage filesets, you can create the 
schema by the following sentence:
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+  "name": "schema",
+  "comment": "comment",
+  "properties": {
+    "location": "file:///tmp/root/schema"
+  }
+}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+    .builder("http://localhost:8090";)
+    .withMetalake("metalake")
+    .build();
+
+// Assuming you have just created a Hadoop catalog named `catalog`
+Catalog catalog = gravitinoClient.loadCatalog("catalog");
+
+SupportsSchemas supportsSchemas = catalog.asSchemas();
+
+Map<String, String> schemaProperties = ImmutableMap.<String, String>builder()
+    // Property "location" is optional, if specified all the managed fileset 
without
+    // specifying storage location will be stored under this location.
+    .put("location", "file:///tmp/root/schema")
+    .build();
+Schema schema = supportsSchemas.createSchema("schema",
+    "This is a schema",
+    schemaProperties
+);
+// ...
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+You can change the value of property `location` according to which catalog you 
are using, moreover, if we have set the `location` property in the catalog, we 
can omit the `location` property in the schema.
+
+## Create filesets
+
+The following sentences can be used to create a fileset in the schema:
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+  "name": "example_fileset",
+  "comment": "This is an example fileset",
+  "type": "MANAGED",
+  "storageLocation": "file:///tmp/root/schema/example_fileset",
+  "properties": {
+    "k1": "v1"
+  }
+}' 
http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/filesets
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+    .builder("http://localhost:8090";)
+    .withMetalake("metalake")
+    .build();
+
+Catalog catalog = gravitinoClient.loadCatalog("catalog");
+FilesetCatalog filesetCatalog = catalog.asFilesetCatalog();
+
+Map<String, String> propertiesMap = ImmutableMap.<String, String>builder()
+        .put("k1", "v1")
+        .build();
+
+filesetCatalog.createFileset(
+  NameIdentifier.of("schema", "example_fileset"),
+  "This is an example fileset",
+  Fileset.Type.MANAGED,
+  "file:///tmp/root/schema/example_fileset",
+  propertiesMap,
+);
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient = 
GravitinoClient(uri="http://localhost:8090";, metalake_name="metalake")
+
+catalog: Catalog = gravitino_client.load_catalog(name="catalog")
+catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("schema", 
"example_fileset"),
+                                            type=Fileset.Type.MANAGED,
+                                            comment="This is an example 
fileset",
+                                            
storage_location="/tmp/root/schema/example_fileset",
+                                            properties={"k1": "v1"})
+```
+
+</TabItem>
+</Tabs>
+
+Similar to schema, the `storageLocation` is optional if you have set the 
`location` property in the schema or catalog. Please change the value of 
+`location` as the actual location you want to store the fileset. 
+
+
+## Using Spark to access the fileset
+
+The following code snippet shows how to use **Spark 3.1.3 with hadoop 
environment** to access the fileset
+
+```python
+import logging
+from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset, 
GravitinoAdminClient
+from pyspark.sql import SparkSession
+import os
+
+gravitino_url = "http://localhost:8090";
+metalake_name = "test"
+
+catalog_name = "s3_catalog"
+schema_name = "schema"
+fileset_name = "example"
+
+## this is for S3
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars 
/Users/yuqi/project/gravitino/bundles/aws-bundle/build/libs/gravitino-aws-bundle-0.8.0-incubating-SNAPSHOT.jar,/Users/yuqi/project/gravitino/clients/filesystem-hadoop3-runtime/build/libs/gravitino-filesystem-hadoop3-runtime-0.8.0-incubating-SNAPSHOT.jar,/Users/yuqi/Downloads/hadoop-jars/hadoop-aws-3.2.0.jar,/Users/yuqi/Downloads/aws-java-sdk-bundle-1.11.375.jar
 --master local[1] pyspark-shell"
+spark = SparkSession.builder
+  .appName("s3_fielset_test")
+  .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", 
"org.apache.gravitino.filesystem.hadoop.Gvfs")
+  .config("spark.hadoop.fs.gvfs.impl", 
"org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem")
+  .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090";)
+  .config("spark.hadoop.fs.gravitino.client.metalake", "test")
+  .config("spark.hadoop.s3-access-key-id", os.environ["S3_ACCESS_KEY_ID"])
+  .config("spark.hadoop.s3-secret-access-key", 
os.environ["S3_SECRET_ACCESS_KEY"])
+  .config("spark.hadoop.s3-endpoint", "http://s3.ap-northeast-1.amazonaws.com";)
+  .config("spark.driver.memory", "2g")
+  .config("spark.driver.port", "2048")
+  .getOrCreate()
+
+### this is for GCS
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars 
/Users/yuqi/project/gravitino/bundles/gcp-bundle/build/libs/gravitino-gcp-bundle-0.8.0-incubating-SNAPSHOT.jar,/Users/yuqi/project/gravitino/clients/filesystem-hadoop3-runtime/build/libs/gravitino-filesystem-hadoop3-runtime-0.8.0-incubating-SNAPSHOT.jar,/Users/yuqi/Downloads/hadoop-jars/gcs-connector-hadoop3-2.2.22-shaded.jar
 --master local[1] pyspark-shell"
+spark = SparkSession.builder
+  .appName("s3_fielset_test")
+  .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", 
"org.apache.gravitino.filesystem.hadoop.Gvfs")
+  .config("spark.hadoop.fs.gvfs.impl", 
"org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem")
+  .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090";)
+  .config("spark.hadoop.fs.gravitino.client.metalake", "test")
+  .config("spark.hadoop.gcs-service-account-file", 
"/Users/yuqi/Downloads/silken-physics-431108-g3-30ab3d97bb60.json")
+  .config("spark.driver.memory", "2g")
+  .config("spark.driver.port", "2048")
+  .getOrCreate()
+
+### this is for OSS
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars 
/Users/yuqi/project/gravitino/bundles/aliyun-bundle/build/libs/gravitino-aliyun-bundle-0.8.0-incubating-SNAPSHOT.jar,/Users/yuqi/project/gravitino/clients/filesystem-hadoop3-runtime/build/libs/gravitino-filesystem-hadoop3-runtime-0.8.0-incubating-SNAPSHOT.jar,/Users/yuqi/Downloads/hadoop-jars/aliyun-sdk-oss-2.8.3.jar,/Users/yuqi/Downloads/hadoop-jars/hadoop-aliyun-3.2.0.jar,/Users/yuqi/Downloads/hadoop-jars/jdom-1.1.jar
 --master local[1] pyspark-shell"
+spark = SparkSession.builder
+  .appName("s3_fielset_test")
+  .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", 
"org.apache.gravitino.filesystem.hadoop.Gvfs")
+  .config("spark.hadoop.fs.gvfs.impl", 
"org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem")
+  .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090";)
+  .config("spark.hadoop.fs.gravitino.client.metalake", "test")
+  .config("spark.hadoop.oss-access-key-id", os.environ["OSS_ACCESS_KEY_ID"])
+  .config("spark.hadoop.oss-secret-access-key", 
os.environ["S3_SECRET_ACCESS_KEY"])
+  .config("spark.hadoop.oss-endpoint", "https://oss-cn-shanghai.aliyuncs.com";)
+  .config("spark.driver.memory", "2g")
+  .config("spark.driver.port", "2048")
+  .getOrCreate()
+spark.sparkContext.setLogLevel("DEBUG")
+
+### this is for ABS
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars 
/Users/yuqi/project/gravitino/bundles/azure-bundle/build/libs/gravitino-azure-bundle-0.8.0-incubating-SNAPSHOT.jar,/Users/yuqi/project/gravitino/clients/filesystem-hadoop3-runtime/build/libs/gravitino-filesystem-hadoop3-runtime-0.8.0-incubating-SNAPSHOT.jar,/Users/yuqi/Downloads/hadoop-jars/hadoop-azure-3.2.0.jar,/Users/yuqi/Downloads/hadoop-jars/azure-storage-7.0.0.jar,/Users/yuqi/Downloads/hadoop-jars/wildfly-openssl-1.0.4.Final.jar
 --master local[1] pyspark-shell"
+spark = SparkSession.builder
+  .appName("s3_fielset_test")
+  .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", 
"org.apache.gravitino.filesystem.hadoop.Gvfs")
+  .config("spark.hadoop.fs.gvfs.impl", 
"org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem")
+  .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090";)
+  .config("spark.hadoop.fs.gravitino.client.metalake", "test")
+  .config("spark.hadoop.abs-account-name", "xiaoyu456")
+  .config("spark.hadoop.abs-account-key", "account_key")
+  .config("spark.hadoop.fs.azure.skipUserGroupMetadataDuringInitialization", 
"true")
+  .config("spark.driver.memory", "2g")
+  .config("spark.driver.port", "2048")
+  .getOrCreate()
+
+data = [("Alice", 25), ("Bob", 30), ("Cathy", 45)]
+columns = ["Name", "Age"]
+spark_df = spark.createDataFrame(data, schema=columns)
+gvfs_path = 
f"gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/people"
+
+spark_df.coalesce(1).write
+  .mode("overwrite")
+  .option("header", "true")
+  .csv(gvfs_path)
+
+```
+
+If your spark has no hadoop environment, you can use the following code 
snippet to access the fileset:
+
+```python
+## replace the env PYSPARK_SUBMIT_ARGS variable in the code above with the 
following content:
+### S3
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars 
/Users/yuqi/project/gravitino/bundles/aws-hadoop-bundle/build/libs/gravitino-aws-hadoop-bundle-0.8.0-incubating-SNAPSHOT.jar,/Users/yuqi/project/gravitino/clients/filesystem-hadoop3-runtime/build/libs/gravitino-filesystem-hadoop3-runtime-0.8.0-incubating-SNAPSHOT.jar
 --master local[1] pyspark-shell"

Review Comment:
   Can you avoid hardcode the gravitino version number here like 
`0.8.0-incubating-SNAPSHOT`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@gravitino.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [#5585] improvement(bundles): Refactor bundle jars and provide core jars that does not contains hadoop-{aws,gcp,aliyun,azure} [gravitino]

Reply via email to