jerryshao commented on code in PR #5806: URL: https://github.com/apache/gravitino/pull/5806#discussion_r1897136150
########## bundles/aliyun-hadoop-bundle/build.gradle.kts: ########## @@ -0,0 +1,59 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +import com.github.jengelman.gradle.plugins.shadow.tasks.ShadowJar + +plugins { + `maven-publish` + id("java") + alias(libs.plugins.shadow) +} + +dependencies { + implementation(project(":bundles:aliyun-bundle")) + implementation(libs.hadoop3.oss) + implementation(libs.hadoop3.client.api) + implementation(libs.hadoop3.client.runtime) + + implementation(libs.httpclient) + implementation(libs.commons.collections3) Review Comment: Can we make all of them alphabetically ordered? ########## bundles/aliyun-bundle/build.gradle.kts: ########## @@ -47,10 +45,21 @@ dependencies { // org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem.initialize(AliyunOSSFileSystem.java:323) // org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3611) implementation(libs.commons.lang3) + implementation(libs.httpclient) + implementation(libs.commons.collections3) + + implementation(libs.guava) + implementation(libs.jackson.databind) + implementation(libs.jackson.annotations) + implementation(libs.jackson.datatype.jdk8) + implementation(libs.jackson.datatype.jsr310) implementation(project(":catalogs:catalog-common")) { exclude("*") } + implementation(project(":catalogs:hadoop-common")) { + exclude("*") + } Review Comment: Can we move this module dependencies to the top of the `implementation` block. ########## catalogs/catalog-hadoop/build.gradle.kts: ########## @@ -45,25 +45,10 @@ dependencies { exclude(group = "*") } - implementation(libs.hadoop3.common) { - exclude("com.sun.jersey") - exclude("javax.servlet", "servlet-api") - exclude("org.eclipse.jetty", "*") - exclude("org.apache.hadoop", "hadoop-auth") - exclude("org.apache.curator", "curator-client") - exclude("org.apache.curator", "curator-framework") - exclude("org.apache.curator", "curator-recipes") - exclude("org.apache.avro", "avro") - exclude("com.sun.jersey", "jersey-servlet") - } - - implementation(libs.hadoop3.client) { - exclude("org.apache.hadoop", "hadoop-mapreduce-client-core") - exclude("org.apache.hadoop", "hadoop-mapreduce-client-jobclient") - exclude("org.apache.hadoop", "hadoop-yarn-api") - exclude("org.apache.hadoop", "hadoop-yarn-client") - exclude("com.squareup.okhttp", "okhttp") - } + implementation(libs.commons.lang3) + implementation(libs.commons.io) + implementation(libs.hadoop3.client.api) + implementation(libs.hadoop3.client.runtime) Review Comment: Also here. ########## docs/cloud-storage-fileset-example.md: ########## @@ -0,0 +1,681 @@ +--- +title: "How to use cloud storage fileset" +slug: /how-to-use-cloud-storage-fileset +keyword: fileset S3 GCS ADLS OSS +license: "This software is licensed under the Apache License version 2." +--- + +This document aims to provide a comprehensive guide on how to use cloud storage fileset created by Gravitino, it usually contains the following sections: + + +## Start up Gravitino server + +### Start up Gravitino server + +Before running the Gravitino server, you need to put the following jars into the fileset class path located in `${GRAVITINO_HOME}/catalogs/hadoop/libs`. For example, if you are using S3, you need to put gravitino-aws-hadoop-bundles-{version}.jar into the fileset class path. + + +| Storage type | Description | Jar file | Since Version | +|--------------|---------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|------------------| +| Local file | The local file system. | (none) | 0.5.0 | +| HDFS | HDFS file system. | (none) | 0.5.0 | +| S3 | AWS S3 storage. | [gravitino-aws-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aws-bundle) | 0.8.0-incubating | +| GCS | Google Cloud Storage. | [gravitino-gcp-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-gcp-bundle) | 0.8.0-incubating | +| OSS | Aliyun OSS storage. | [gravitino-aliyun-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aliyun-bundle) | 0.8.0-incubating | +| ABS | Azure Blob Storage (aka. ABS, or Azure Data Lake Storage (v2) | [gravitino-azure-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-azure-bundle) | 0.8.0-incubating | + +After putting the jars into the fileset class path, you can start up the Gravitino server by running the following command: + +```shell +cd ${GRAVITINO_HOME} +bin/gravitino.sh start +``` + +### Bundle jars + +`gravitino-{aws,gcp,aliyun,azure}-hadoop-bundle` are the jars that contain all the necessary classes to access the corresponding cloud storages, for instance, gravitino-aws-hadoop-bundle contains the all necessary classes like `hadoop-common`(hadoop-3.3.1) and `hadoop-aws` to access the S3 storage. +**They are used in the scenario where there is no hadoop environment in the runtime.** + +**If there is already hadoop environment in the runtime, you can use the `gravitino-{aws,gcp,aliyun,azure}-bundle.jar` that does not contain the cloud storage classes (like hadoop-aws) and hadoop environment, you can manually add the necessary jars to the classpath.** + +The following picture demonstrates what jars are necessary for different cloud storage filesets: Review Comment: Will you add a picture here? ########## bundles/aws-bundle/src/main/java/org/apache/gravitino/s3/fs/S3FileSystemProvider.java: ########## @@ -31,30 +36,81 @@ import org.apache.hadoop.fs.Path; import org.apache.hadoop.fs.s3a.Constants; import org.apache.hadoop.fs.s3a.S3AFileSystem; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; public class S3FileSystemProvider implements FileSystemProvider { + private static final Logger LOGGER = LoggerFactory.getLogger(S3FileSystemProvider.class); + @VisibleForTesting public static final Map<String, String> GRAVITINO_KEY_TO_S3_HADOOP_KEY = ImmutableMap.of( S3Properties.GRAVITINO_S3_ENDPOINT, Constants.ENDPOINT, S3Properties.GRAVITINO_S3_ACCESS_KEY_ID, Constants.ACCESS_KEY, S3Properties.GRAVITINO_S3_SECRET_ACCESS_KEY, Constants.SECRET_KEY); + // We can't use Constants.AWS_CREDENTIALS_PROVIDER directly, as in 2.7, this key does not exist. + private static final String S3_CREDENTIAL_KEY = "fs.s3a.aws.credentials.provider"; + private static final String S3_SIMPLE_CREDENTIAL = + "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider"; + @Override public FileSystem getFileSystem(Path path, Map<String, String> config) throws IOException { Configuration configuration = new Configuration(); Map<String, String> hadoopConfMap = FileSystemUtils.toHadoopConfigMap(config, GRAVITINO_KEY_TO_S3_HADOOP_KEY); - if (!hadoopConfMap.containsKey(Constants.AWS_CREDENTIALS_PROVIDER)) { - configuration.set( - Constants.AWS_CREDENTIALS_PROVIDER, Constants.ASSUMED_ROLE_CREDENTIALS_DEFAULT); + if (!hadoopConfMap.containsKey(S3_CREDENTIAL_KEY)) { + hadoopConfMap.put(S3_CREDENTIAL_KEY, S3_SIMPLE_CREDENTIAL); } + hadoopConfMap.forEach(configuration::set); + + // Hadoop-aws 2 does not support IAMInstanceCredentialsProvider + checkAndSetCredentialProvider(configuration); + return S3AFileSystem.newInstance(path.toUri(), configuration); } + private void checkAndSetCredentialProvider(Configuration configuration) { Review Comment: Is it related to this PR? ########## bundles/aliyun-bundle/build.gradle.kts: ########## @@ -62,6 +71,10 @@ tasks.withType(ShadowJar::class.java) { // Relocate dependencies to avoid conflicts relocate("org.jdom", "org.apache.gravitino.shaded.org.jdom") relocate("org.apache.commons.lang3", "org.apache.gravitino.shaded.org.apache.commons.lang3") + relocate("com.fasterxml.jackson", "org.apache.gravitino.shaded.com.fasterxml.jackson") + relocate("com.google.common", "org.apache.gravitino.shaded.com.google.common") + relocate("org.apache.http", "org.apache.gravitino.shaded.org.apache.http") + relocate("org.apache.commons.collections", "org.apache.gravitino.shaded.org.apache.commons.collections") Review Comment: Shall we relocate to `org.apache.gravitino.aliyun.shaded` to avoid conflict with other bundle jars? ########## settings.gradle.kts: ########## @@ -77,8 +77,12 @@ project(":spark-connector:spark-runtime-3.5").projectDir = file("spark-connector include("web:web", "web:integration-test") include("docs") include("integration-test-common") +include(":bundles:aws-hadoop-bundle") include(":bundles:aws-bundle") Review Comment: Can you group them together like `include(":bundles:aws-hadoop-bundle", ":bundles:aws-bundle")` ########## bundles/aliyun-hadoop-bundle/build.gradle.kts: ########## @@ -0,0 +1,59 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +import com.github.jengelman.gradle.plugins.shadow.tasks.ShadowJar + +plugins { + `maven-publish` + id("java") + alias(libs.plugins.shadow) +} + +dependencies { + implementation(project(":bundles:aliyun-bundle")) + implementation(libs.hadoop3.oss) + implementation(libs.hadoop3.client.api) + implementation(libs.hadoop3.client.runtime) + + implementation(libs.httpclient) + implementation(libs.commons.collections3) +} + +tasks.withType(ShadowJar::class.java) { + isZip64 = true + configurations = listOf(project.configurations.runtimeClasspath.get()) + archiveClassifier.set("") + mergeServiceFiles() + + // Relocate dependencies to avoid conflicts + relocate("org.jdom", "org.apache.gravitino.shaded.org.jdom") + relocate("org.apache.commons.lang3", "org.apache.gravitino.shaded.org.apache.commons.lang3") + relocate("com.fasterxml.jackson", "org.apache.gravitino.shaded.com.fasterxml.jackson") + relocate("com.google.common", "org.apache.gravitino.shaded.com.google.common") + relocate("org.apache.http", "org.apache.gravitino.shaded.org.apache.http") + relocate("org.apache.commons.collections", "org.apache.gravitino.shaded.org.apache.commons.collections") Review Comment: Also here. I think we'd better relocate as `org.apache.gravitino.{xxx}.shaded.` to avoid conflicts if we put all the bundle jars into the same classloader. ########## clients/filesystem-hadoop3-runtime/build.gradle.kts: ########## @@ -28,6 +28,8 @@ plugins { dependencies { implementation(project(":clients:filesystem-hadoop3")) implementation(project(":clients:client-java-runtime", configuration = "shadow")) + Review Comment: Please remove unnecessary blank line here. ########## docs/cloud-storage-fileset-example.md: ########## @@ -0,0 +1,681 @@ +--- +title: "How to use cloud storage fileset" +slug: /how-to-use-cloud-storage-fileset +keyword: fileset S3 GCS ADLS OSS +license: "This software is licensed under the Apache License version 2." +--- + +This document aims to provide a comprehensive guide on how to use cloud storage fileset created by Gravitino, it usually contains the following sections: + + +## Start up Gravitino server + +### Start up Gravitino server + +Before running the Gravitino server, you need to put the following jars into the fileset class path located in `${GRAVITINO_HOME}/catalogs/hadoop/libs`. For example, if you are using S3, you need to put gravitino-aws-hadoop-bundles-{version}.jar into the fileset class path. + + +| Storage type | Description | Jar file | Since Version | +|--------------|---------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|------------------| +| Local file | The local file system. | (none) | 0.5.0 | +| HDFS | HDFS file system. | (none) | 0.5.0 | +| S3 | AWS S3 storage. | [gravitino-aws-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aws-bundle) | 0.8.0-incubating | +| GCS | Google Cloud Storage. | [gravitino-gcp-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-gcp-bundle) | 0.8.0-incubating | +| OSS | Aliyun OSS storage. | [gravitino-aliyun-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-aliyun-bundle) | 0.8.0-incubating | +| ABS | Azure Blob Storage (aka. ABS, or Azure Data Lake Storage (v2) | [gravitino-azure-hadoop-bundle](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-hadoop-azure-bundle) | 0.8.0-incubating | + +After putting the jars into the fileset class path, you can start up the Gravitino server by running the following command: + +```shell +cd ${GRAVITINO_HOME} +bin/gravitino.sh start +``` + +### Bundle jars + +`gravitino-{aws,gcp,aliyun,azure}-hadoop-bundle` are the jars that contain all the necessary classes to access the corresponding cloud storages, for instance, gravitino-aws-hadoop-bundle contains the all necessary classes like `hadoop-common`(hadoop-3.3.1) and `hadoop-aws` to access the S3 storage. +**They are used in the scenario where there is no hadoop environment in the runtime.** + +**If there is already hadoop environment in the runtime, you can use the `gravitino-{aws,gcp,aliyun,azure}-bundle.jar` that does not contain the cloud storage classes (like hadoop-aws) and hadoop environment, you can manually add the necessary jars to the classpath.** + +The following picture demonstrates what jars are necessary for different cloud storage filesets: + +| Hadoop runtime version | S3 | GCS | OSS | ABS | +|------------------------|----------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------| +| No Hadoop environment | gravitino-aws-hadoop-bundle-{gravitino-version}.jar | gravitino-gcp-hadoop-bundle-{gravitino-version}.jar | gravitino-aliyun-hadoop-bundle-{gravitino-version}.jar | gravitino-azure-hadoop-bundle-{gravitino-version}.jar | +| 2.x, 3.x | gravitino-aws-bundle-{gravitino-version}.jar, hadoop-aws-{hadoop-version}.jar, aws-sdk-java-{version} and other necessary dependencies | gravitino-gcp-bundle-{gravitino-version}.jar, gcs-connector-{hadoop-version}.jar, other necessary dependencies | gravitino-aliyun-bundle-{gravitino-version}.jar, hadoop-aliyun-{hadoop-version}.jar, aliyun-sdk-java-{version} and other necessary dependencies | gravitino-azure-bundle-{gravitino-version}.jar, hadoop-azure-{hadoop-version}.jar, and other necessary dependencies | + +For `hadoop-aws-{version}.jar`, `hadoop-azure-{version}.jar` and `hadoop-aliyun-{version}.jar` and related dependencies, you can get it from ${HADOOP_HOME}/share/hadoop/tools/lib/ directory. +For `gcs-connector`, you can download it from the [GCS connector](https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-2.2.22-shaded.jar) for hadoop2 or hadoop3. + +If there still have some issues, please report it to the Gravitino community and create an issue. + +:::note +Gravitino server uses Hadoop 3.3.1, and you only need to put the corresponding bundle jars into the fileset class path. +::: + + +## Create fileset catalogs + +Once the Gravitino server is started, you can create the corresponding fileset by the following sentence: + + +### Create a S3 fileset catalog + +<Tabs groupId="language" queryString> +<TabItem value="shell" label="Shell"> + +```shell +curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ +-H "Content-Type: application/json" -d '{ + "name": "catalog", + "type": "FILESET", + "comment": "comment", + "provider": "hadoop", + "properties": { + "location": "s3a://bucket/root", + "s3-access-key-id": "access_key", + "s3-secret-access-key": "secret_key", + "s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com", + "filesystem-providers": "s3" + } +}' http://localhost:8090/api/metalakes/metalake/catalogs +``` + +</TabItem> +<TabItem value="java" label="Java"> + +```java +GravitinoClient gravitinoClient = GravitinoClient + .builder("http://localhost:8090") + .withMetalake("metalake") + .build(); + +s3Properties = ImmutableMap.<String, String>builder() + .put("location", "s3a://bucket/root") + .put("s3-access-key-id", "access_key") + .put("s3-secret-access-key", "secret_key") + .put("s3-endpoint", "http://s3.ap-northeast-1.amazonaws.com") + .put("filesystem-providers", "s3") + .build(); + +Catalog s3Catalog = gravitinoClient.createCatalog("catalog", + Type.FILESET, + "hadoop", // provider, Gravitino only supports "hadoop" for now. + "This is a S3 fileset catalog", + s3Properties); +// ... + +``` + +</TabItem> +<TabItem value="python" label="Python"> + +```python +gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") +s3_properties = { + "location": "s3a://bucket/root", + "s3-access-key-id": "access_key" + "s3-secret-access-key": "secret_key", + "s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com" +} + +s3_catalog = gravitino_client.create_catalog(name="catalog", + type=Catalog.Type.FILESET, + provider="hadoop", + comment="This is a S3 fileset catalog", + properties=s3_properties) + +``` + +</TabItem> +</Tabs> + +:::note +The value of location should always start with `s3a` NOT `s3`, for instance, `s3a://bucket/root`. Value like `s3://bucket/root` is not supported. +::: + +### Create a GCS fileset catalog + +<Tabs groupId="language" queryString> +<TabItem value="shell" label="Shell"> + +```shell +curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ +-H "Content-Type: application/json" -d '{ + "name": "catalog", + "type": "FILESET", + "comment": "comment", + "provider": "hadoop", + "properties": { + "location": "gs://bucket/root", + "gcs-service-account-file": "path_of_gcs_service_account_file", + "filesystem-providers": "gcs" + } +}' http://localhost:8090/api/metalakes/metalake/catalogs +``` + +</TabItem> +<TabItem value="java" label="Java"> + +```java +GravitinoClient gravitinoClient = GravitinoClient + .builder("http://localhost:8090") + .withMetalake("metalake") + .build(); + +gcsProperties = ImmutableMap.<String, String>builder() + .put("location", "gs://bucket/root") + .put("gcs-service-account-file", "path_of_gcs_service_account_file") + .put("filesystem-providers", "gcs") + .build(); + +Catalog gcsCatalog = gravitinoClient.createCatalog("catalog", + Type.FILESET, + "hadoop", // provider, Gravitino only supports "hadoop" for now. + "This is a GCS fileset catalog", + gcsProperties); +// ... + +``` + +</TabItem> +<TabItem value="python" label="Python"> + +```python +gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") + +gcs_properties = { + "location": "gcs://bucket/root", + "gcs_service_account_file": "path_of_gcs_service_account_file" +} + +s3_catalog = gravitino_client.create_catalog(name="catalog", + type=Catalog.Type.FILESET, + provider="hadoop", + comment="This is a GCS fileset catalog", + properties=gcs_properties) + +``` + +</TabItem> +</Tabs> + +### Create an OSS fileset catalog + +<Tabs groupId="language" queryString> +<TabItem value="shell" label="Shell"> + +```shell +curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ +-H "Content-Type: application/json" -d '{ + "name": "catalog", + "type": "FILESET", + "comment": "comment", + "provider": "hadoop", + "properties": { + "location": "oss://bucket/root", + "oss-access-key-id": "access_key", + "oss-secret-access-key": "secret_key", + "oss-endpoint": "http://oss-cn-hangzhou.aliyuncs.com", + "filesystem-providers": "oss" + } +}' http://localhost:8090/api/metalakes/metalake/catalogs +``` + +</TabItem> +<TabItem value="java" label="Java"> + +```java +GravitinoClient gravitinoClient = GravitinoClient + .builder("http://localhost:8090") + .withMetalake("metalake") + .build(); + +ossProperties = ImmutableMap.<String, String>builder() + .put("location", "oss://bucket/root") + .put("oss-access-key-id", "access_key") + .put("oss-secret-access-key", "secret_key") + .put("oss-endpoint", "http://oss-cn-hangzhou.aliyuncs.com") + .put("filesystem-providers", "oss") + .build(); + +Catalog ossProperties = gravitinoClient.createCatalog("catalog", + Type.FILESET, + "hadoop", // provider, Gravitino only supports "hadoop" for now. + "This is a OSS fileset catalog", + ossProperties); +// ... + +``` + +</TabItem> +<TabItem value="python" label="Python"> + +```python +gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") +oss_properties = { + "location": "oss://bucket/root", + "oss-access-key-id": "access_key" + "oss-secret-access-key": "secret_key", + "oss-endpoint": "http://oss-cn-hangzhou.aliyuncs.com" +} + +oss_catalog = gravitino_client.create_catalog(name="catalog", + type=Catalog.Type.FILESET, + provider="hadoop", + comment="This is a OSS fileset catalog", + properties=oss_properties) + +``` + +### Create an ABS (Azure Blob Storage or ADLS) fileset catalog + +<Tabs groupId="language" queryString> +<TabItem value="shell" label="Shell"> + +```shell +curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ +-H "Content-Type: application/json" -d '{ + "name": "catalog", + "type": "FILESET", + "comment": "comment", + "provider": "hadoop", + "properties": { + "location": "abfss://container/root", + "abs-account-name": "The account name of the Azure Blob Storage", + "abs-account-key": "The account key of the Azure Blob Storage", + "filesystem-providers": "abs" + } +}' http://localhost:8090/api/metalakes/metalake/catalogs +``` + +</TabItem> +<TabItem value="java" label="Java"> + +```java +GravitinoClient gravitinoClient = GravitinoClient + .builder("http://localhost:8090") + .withMetalake("metalake") + .build(); + +absProperties = ImmutableMap.<String, String>builder() + .put("location", "abfss://container/root") + .put("abs-account-name", "The account name of the Azure Blob Storage") + .put("abs-account-key", "The account key of the Azure Blob Storage") + .put("filesystem-providers", "abs") + .build(); + +Catalog gcsCatalog = gravitinoClient.createCatalog("catalog", + Type.FILESET, + "hadoop", // provider, Gravitino only supports "hadoop" for now. + "This is a Azure Blob storage fileset catalog", + absProperties); +// ... + +``` + +</TabItem> +<TabItem value="python" label="Python"> + +```python +gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") + +abs_properties = { + "location": "gcs://bucket/root", + "abs_account_name": "The account name of the Azure Blob Storage", + "abs_account_key": "The account key of the Azure Blob Storage" +} + +abs_catalog = gravitino_client.create_catalog(name="catalog", + type=Catalog.Type.FILESET, + provider="hadoop", + comment="This is a Azure Blob Storage fileset catalog", + properties=abs_properties) + +``` + +</TabItem> +</Tabs> + +note::: +- The prefix of an ABS (Azure Blob Storage or ADLS (v2)) location should always start with `abfss` NOT `abfs`, for instance, `abfss://container/root`. Value like `abfs://container/root` is not supported. +- The prefix of an AWS S3 location should always start with `s3a` NOT `s3`, for instance, `s3a://bucket/root`. Value like `s3://bucket/root` is not supported. +- The prefix of an Aliyun OSS location should always start with `oss` for instance, `oss://bucket/root`. +- The prefix of a GCS location should always start with `gs` for instance, `gs://bucket/root`. +::: + + +## Create fileset schema + +This part is the same for all cloud storage filesets, you can create the schema by the following sentence: + +<Tabs groupId="language" queryString> +<TabItem value="shell" label="Shell"> + +```shell +curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ +-H "Content-Type: application/json" -d '{ + "name": "schema", + "comment": "comment", + "properties": { + "location": "file:///tmp/root/schema" + } +}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas +``` + +</TabItem> +<TabItem value="java" label="Java"> + +```java +GravitinoClient gravitinoClient = GravitinoClient + .builder("http://localhost:8090") + .withMetalake("metalake") + .build(); + +// Assuming you have just created a Hadoop catalog named `catalog` +Catalog catalog = gravitinoClient.loadCatalog("catalog"); + +SupportsSchemas supportsSchemas = catalog.asSchemas(); + +Map<String, String> schemaProperties = ImmutableMap.<String, String>builder() + // Property "location" is optional, if specified all the managed fileset without + // specifying storage location will be stored under this location. + .put("location", "file:///tmp/root/schema") + .build(); +Schema schema = supportsSchemas.createSchema("schema", + "This is a schema", + schemaProperties +); +// ... +``` + +</TabItem> +<TabItem value="python" label="Python"> + +You can change the value of property `location` according to which catalog you are using, moreover, if we have set the `location` property in the catalog, we can omit the `location` property in the schema. + +## Create filesets + +The following sentences can be used to create a fileset in the schema: + +<Tabs groupId="language" queryString> +<TabItem value="shell" label="Shell"> + +```shell +curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ +-H "Content-Type: application/json" -d '{ + "name": "example_fileset", + "comment": "This is an example fileset", + "type": "MANAGED", + "storageLocation": "file:///tmp/root/schema/example_fileset", + "properties": { + "k1": "v1" + } +}' http://localhost:8090/api/metalakes/metalake/catalogs/catalog/schemas/schema/filesets +``` + +</TabItem> +<TabItem value="java" label="Java"> + +```java +GravitinoClient gravitinoClient = GravitinoClient + .builder("http://localhost:8090") + .withMetalake("metalake") + .build(); + +Catalog catalog = gravitinoClient.loadCatalog("catalog"); +FilesetCatalog filesetCatalog = catalog.asFilesetCatalog(); + +Map<String, String> propertiesMap = ImmutableMap.<String, String>builder() + .put("k1", "v1") + .build(); + +filesetCatalog.createFileset( + NameIdentifier.of("schema", "example_fileset"), + "This is an example fileset", + Fileset.Type.MANAGED, + "file:///tmp/root/schema/example_fileset", + propertiesMap, +); +``` + +</TabItem> +<TabItem value="python" label="Python"> + +```python +gravitino_client: GravitinoClient = GravitinoClient(uri="http://localhost:8090", metalake_name="metalake") + +catalog: Catalog = gravitino_client.load_catalog(name="catalog") +catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("schema", "example_fileset"), + type=Fileset.Type.MANAGED, + comment="This is an example fileset", + storage_location="/tmp/root/schema/example_fileset", + properties={"k1": "v1"}) +``` + +</TabItem> +</Tabs> + +Similar to schema, the `storageLocation` is optional if you have set the `location` property in the schema or catalog. Please change the value of +`location` as the actual location you want to store the fileset. + + +## Using Spark to access the fileset + +The following code snippet shows how to use **Spark 3.1.3 with hadoop environment** to access the fileset + +```python +import logging +from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset, GravitinoAdminClient +from pyspark.sql import SparkSession +import os + +gravitino_url = "http://localhost:8090" +metalake_name = "test" + +catalog_name = "s3_catalog" +schema_name = "schema" +fileset_name = "example" + +## this is for S3 +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /Users/yuqi/project/gravitino/bundles/aws-bundle/build/libs/gravitino-aws-bundle-0.8.0-incubating-SNAPSHOT.jar,/Users/yuqi/project/gravitino/clients/filesystem-hadoop3-runtime/build/libs/gravitino-filesystem-hadoop3-runtime-0.8.0-incubating-SNAPSHOT.jar,/Users/yuqi/Downloads/hadoop-jars/hadoop-aws-3.2.0.jar,/Users/yuqi/Downloads/aws-java-sdk-bundle-1.11.375.jar --master local[1] pyspark-shell" +spark = SparkSession.builder + .appName("s3_fielset_test") + .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") + .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") + .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") + .config("spark.hadoop.fs.gravitino.client.metalake", "test") + .config("spark.hadoop.s3-access-key-id", os.environ["S3_ACCESS_KEY_ID"]) + .config("spark.hadoop.s3-secret-access-key", os.environ["S3_SECRET_ACCESS_KEY"]) + .config("spark.hadoop.s3-endpoint", "http://s3.ap-northeast-1.amazonaws.com") + .config("spark.driver.memory", "2g") + .config("spark.driver.port", "2048") + .getOrCreate() + +### this is for GCS +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /Users/yuqi/project/gravitino/bundles/gcp-bundle/build/libs/gravitino-gcp-bundle-0.8.0-incubating-SNAPSHOT.jar,/Users/yuqi/project/gravitino/clients/filesystem-hadoop3-runtime/build/libs/gravitino-filesystem-hadoop3-runtime-0.8.0-incubating-SNAPSHOT.jar,/Users/yuqi/Downloads/hadoop-jars/gcs-connector-hadoop3-2.2.22-shaded.jar --master local[1] pyspark-shell" +spark = SparkSession.builder + .appName("s3_fielset_test") + .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") + .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") + .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") + .config("spark.hadoop.fs.gravitino.client.metalake", "test") + .config("spark.hadoop.gcs-service-account-file", "/Users/yuqi/Downloads/silken-physics-431108-g3-30ab3d97bb60.json") + .config("spark.driver.memory", "2g") + .config("spark.driver.port", "2048") + .getOrCreate() + +### this is for OSS +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /Users/yuqi/project/gravitino/bundles/aliyun-bundle/build/libs/gravitino-aliyun-bundle-0.8.0-incubating-SNAPSHOT.jar,/Users/yuqi/project/gravitino/clients/filesystem-hadoop3-runtime/build/libs/gravitino-filesystem-hadoop3-runtime-0.8.0-incubating-SNAPSHOT.jar,/Users/yuqi/Downloads/hadoop-jars/aliyun-sdk-oss-2.8.3.jar,/Users/yuqi/Downloads/hadoop-jars/hadoop-aliyun-3.2.0.jar,/Users/yuqi/Downloads/hadoop-jars/jdom-1.1.jar --master local[1] pyspark-shell" +spark = SparkSession.builder + .appName("s3_fielset_test") + .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") + .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") + .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") + .config("spark.hadoop.fs.gravitino.client.metalake", "test") + .config("spark.hadoop.oss-access-key-id", os.environ["OSS_ACCESS_KEY_ID"]) + .config("spark.hadoop.oss-secret-access-key", os.environ["S3_SECRET_ACCESS_KEY"]) + .config("spark.hadoop.oss-endpoint", "https://oss-cn-shanghai.aliyuncs.com") + .config("spark.driver.memory", "2g") + .config("spark.driver.port", "2048") + .getOrCreate() +spark.sparkContext.setLogLevel("DEBUG") + +### this is for ABS +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /Users/yuqi/project/gravitino/bundles/azure-bundle/build/libs/gravitino-azure-bundle-0.8.0-incubating-SNAPSHOT.jar,/Users/yuqi/project/gravitino/clients/filesystem-hadoop3-runtime/build/libs/gravitino-filesystem-hadoop3-runtime-0.8.0-incubating-SNAPSHOT.jar,/Users/yuqi/Downloads/hadoop-jars/hadoop-azure-3.2.0.jar,/Users/yuqi/Downloads/hadoop-jars/azure-storage-7.0.0.jar,/Users/yuqi/Downloads/hadoop-jars/wildfly-openssl-1.0.4.Final.jar --master local[1] pyspark-shell" +spark = SparkSession.builder + .appName("s3_fielset_test") + .config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") + .config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") + .config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090") + .config("spark.hadoop.fs.gravitino.client.metalake", "test") + .config("spark.hadoop.abs-account-name", "xiaoyu456") + .config("spark.hadoop.abs-account-key", "account_key") + .config("spark.hadoop.fs.azure.skipUserGroupMetadataDuringInitialization", "true") + .config("spark.driver.memory", "2g") + .config("spark.driver.port", "2048") + .getOrCreate() + +data = [("Alice", 25), ("Bob", 30), ("Cathy", 45)] +columns = ["Name", "Age"] +spark_df = spark.createDataFrame(data, schema=columns) +gvfs_path = f"gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/people" + +spark_df.coalesce(1).write + .mode("overwrite") + .option("header", "true") + .csv(gvfs_path) + +``` + +If your spark has no hadoop environment, you can use the following code snippet to access the fileset: + +```python +## replace the env PYSPARK_SUBMIT_ARGS variable in the code above with the following content: +### S3 +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /Users/yuqi/project/gravitino/bundles/aws-hadoop-bundle/build/libs/gravitino-aws-hadoop-bundle-0.8.0-incubating-SNAPSHOT.jar,/Users/yuqi/project/gravitino/clients/filesystem-hadoop3-runtime/build/libs/gravitino-filesystem-hadoop3-runtime-0.8.0-incubating-SNAPSHOT.jar --master local[1] pyspark-shell" Review Comment: Can you avoid hardcode the gravitino version number here like `0.8.0-incubating-SNAPSHOT`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gravitino.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org