yuqi1129 commented on code in PR #6059: URL: https://github.com/apache/gravitino/pull/6059#discussion_r1909932930
########## docs/how-to-use-gvfs.md: ########## @@ -42,7 +42,9 @@ the path mapping and convert automatically. ### Prerequisites -+ A Hadoop environment with HDFS or other Hadoop Compatible File System (HCFS) implementations like S3, GCS, etc. GVFS has been tested against Hadoop 3.3.1. It is recommended to use Hadoop 3.3.1 or later, but it should work with Hadoop 2.x. Please create an [issue](https://www.github.com/apache/gravitino/issues) if you find any compatibility issues. + - GVFS has been tested against Hadoop 3.3.1. It is recommended to use Hadoop 3.3.1 or later, but it should work with Hadoop 2. + x. Please create an [issue](https://www.github.com/apache/gravitino/issues) if you find any + compatibility issues. ### Configuration Review Comment: Configuration about Python GVFS is just in the chapter below, the whole structure is designed by jiebao ########## docs/hadoop-catalog-with-adls.md: ########## @@ -0,0 +1,442 @@ +--- +title: "Hadoop catalog with ADLS" +slug: /hadoop-catalog-with-adls +date: 2025-01-03 +keyword: Hadoop catalog ADLS +license: "This software is licensed under the Apache License version 2." +--- + +This document describes how to configure a Hadoop catalog with ADLS (Azure Blob Storage). + +## Prerequisites + +To set up a Hadoop catalog with ADLS, follow these steps: + +1. Download the [`gravitino-azure-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure-bundle) file. +2. Place the downloaded file into the Gravitino Hadoop catalog classpath at `${GRAVITINO_HOME}/catalogs/hadoop/libs/`. +3. Start the Gravitino server by running the following command: + +```bash +$ bin/gravitino-server.sh start +``` +Once the server is up and running, you can proceed to configure the Hadoop catalog with ADLS. + +## Configurations for creating a Hadoop catalog with ADLS + +### Configuration for a ADLS Hadoop catalog + +Apart from configurations mentioned in [Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), the following properties are required to configure a Hadoop catalog with ADLS: + +| Configuration item | Description | Default value | Required | Since version | +|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------|------------------| +| `filesystem-providers` | The file system providers to add. Set it to `abs` if it's a Azure Blob Storage fileset, or a comma separated string that contains `abs` like `oss,abs,s3` to support multiple kinds of fileset including `abs`. | (none) | Yes | 0.8.0-incubating | +| `default-filesystem-provider` | The name default filesystem providers of this Hadoop catalog if users do not specify the scheme in the URI. Default value is `builtin-local`, for Azure Blob Storage, if we set this value, we can omit the prefix 'abfss://' in the location. | `builtin-local` | No | 0.8.0-incubating | +| `azure-storage-account-name ` | The account name of Azure Blob Storage. | (none) | Yes | 0.8.0-incubating | +| `azure-storage-account-key` | The account key of Azure Blob Storage. | (none) | Yes | 0.8.0-incubating | + +### Configurations for a schema + +Refer to [Schema configurations](./hadoop-catalog.md#schema-properties) for more details. + +### Configurations for a fileset + +Refer to [Fileset configurations](./hadoop-catalog.md#fileset-properties) for more details. + +## Example of creating Hadoop catalog with ADLS + +This section demonstrates how to create the Hadoop catalog with ADLS in Gravitino, with a complete example. + +### Step1: Create a Hadoop catalog with ADLS + +First, you need to create a Hadoop catalog with ADLS. The following example shows how to create a Hadoop catalog with ADLS: + +<Tabs groupId="language" queryString> +<TabItem value="shell" label="Shell"> + +```shell +curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ +-H "Content-Type: application/json" -d '{ + "name": "example_catalog", + "type": "FILESET", + "comment": "This is a ADLS fileset catalog", + "provider": "hadoop", + "properties": { + "location": "abfss://contai...@account-name.dfs.core.windows.net/path", + "azure-storage-account-name": "The account name of the Azure Blob Storage", + "azure-storage-account-key": "The account key of the Azure Blob Storage", + "filesystem-providers": "abs" + } +}' ${GRAVITINO_SERVER_IP:PORT}/api/metalakes/metalake/catalogs +``` + +</TabItem> +<TabItem value="java" label="Java"> + +```java +GravitinoClient gravitinoClient = GravitinoClient + .builder("${GRAVITINO_SERVER_IP:PORT}") + .withMetalake("metalake") + .build(); + +adlsProperties = ImmutableMap.<String, String>builder() + .put("location", "abfss://contai...@account-name.dfs.core.windows.net/path") + .put("azure-storage-account-name", "azure storage account name") + .put("azure-storage-account-key", "azure storage account key") + .put("filesystem-providers", "abs") + .build(); + +Catalog adlsCatalog = gravitinoClient.createCatalog("example_catalog", + Type.FILESET, + "hadoop", // provider, Gravitino only supports "hadoop" for now. + "This is a ADLS fileset catalog", + adlsProperties); +// ... + +``` + +</TabItem> +<TabItem value="python" label="Python"> + +```python +gravitino_client: GravitinoClient = GravitinoClient(uri="${GRAVITINO_SERVER_IP:PORT}", metalake_name="metalake") +adls_properties = { + "location": "abfss://contai...@account-name.dfs.core.windows.net/path", + "azure-storage-account-name": "azure storage account name", + "azure-storage-account-key": "azure storage account key", + "filesystem-providers": "abs" +} + +adls_properties = gravitino_client.create_catalog(name="example_catalog", + type=Catalog.Type.FILESET, + provider="hadoop", + comment="This is a ADLS fileset catalog", + properties=adls_properties) + +``` + +</TabItem> +</Tabs> + +### Step2: Create a schema + +Once the catalog is created, you can create a schema. The following example shows how to create a schema: + +<Tabs groupId="language" queryString> +<TabItem value="shell" label="Shell"> + +```shell +curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ +-H "Content-Type: application/json" -d '{ + "name": "test_schema", + "comment": "This is a ADLS schema", + "properties": { + "location": "abfss://contai...@account-name.dfs.core.windows.net/path" + } +}' ${GRAVITINO_SERVER_IP:PORT}/api/metalakes/metalake/catalogs/test_catalog/schemas +``` + +</TabItem> +<TabItem value="java" label="Java"> + +```java +Catalog catalog = gravitinoClient.loadCatalog("test_catalog"); + +SupportsSchemas supportsSchemas = catalog.asSchemas(); + +Map<String, String> schemaProperties = ImmutableMap.<String, String>builder() + .put("location", "abfss://contai...@account-name.dfs.core.windows.net/path") + .build(); +Schema schema = supportsSchemas.createSchema("test_schema", + "This is a ADLS schema", + schemaProperties +); +// ... +``` + +</TabItem> +<TabItem value="python" label="Python"> + +```python +gravitino_client: GravitinoClient = GravitinoClient(uri="http://127.0.0.1:8090", metalake_name="metalake") +catalog: Catalog = gravitino_client.load_catalog(name="test_catalog") +catalog.as_schemas().create_schema(name="test_schema", + comment="This is a ADLS schema", + properties={"location": "abfss://contai...@account-name.dfs.core.windows.net/path"}) +``` + +</TabItem> +</Tabs> + +### Step3: Create a fileset + +After creating the schema, you can create a fileset. The following example shows how to create a fileset: + +<Tabs groupId="language" queryString> +<TabItem value="shell" label="Shell"> + +```shell +curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \ +-H "Content-Type: application/json" -d '{ + "name": "example_fileset", + "comment": "This is an example fileset", + "type": "MANAGED", + "storageLocation": "abfss://contai...@account-name.dfs.core.windows.net/path/example_fileset", + "properties": { + "k1": "v1" + } +}' ${GRAVITINO_SERVER_IP:PORT}/api/metalakes/metalake/catalogs/test_catalog/schemas/test_schema/filesets +``` + +</TabItem> +<TabItem value="java" label="Java"> + +```java +GravitinoClient gravitinoClient = GravitinoClient + .builder("${GRAVITINO_SERVER_IP:PORT}") + .withMetalake("metalake") + .build(); + +Catalog catalog = gravitinoClient.loadCatalog("test_catalog"); +FilesetCatalog filesetCatalog = catalog.asFilesetCatalog(); + +Map<String, String> propertiesMap = ImmutableMap.<String, String>builder() + .put("k1", "v1") + .build(); + +filesetCatalog.createFileset( + NameIdentifier.of("test_schema", "example_fileset"), + "This is an example fileset", + Fileset.Type.MANAGED, + "abfss://contai...@account-name.dfs.core.windows.net/path/example_fileset", + propertiesMap, +); +``` + +</TabItem> +<TabItem value="python" label="Python"> + +```python +gravitino_client: GravitinoClient = GravitinoClient(uri="${GRAVITINO_SERVER_IP:PORT}", metalake_name="metalake") + +catalog: Catalog = gravitino_client.load_catalog(name="test_catalog") +catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("test_schema", "example_fileset"), + type=Fileset.Type.MANAGED, + comment="This is an example fileset", + storage_location="abfss://contai...@account-name.dfs.core.windows.net/path/example_fileset", + properties={"k1": "v1"}) +``` + +</TabItem> +</Tabs> + +## Accessing a fileset with ADLS + +### Using Spark to access the fileset + +The following code snippet shows how to use **PySpark 3.1.3 with Hadoop environment(Hadoop 3.2.0)** to access the fileset: + +```python +import logging +from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset, GravitinoAdminClient +from pyspark.sql import SparkSession +import os + +gravitino_url = "${GRAVITINO_SERVER_IP:PORT}" +metalake_name = "test" + +catalog_name = "your_adls_catalog" +schema_name = "your_adls_schema" +fileset_name = "your_adls_fileset" + +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-azure-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/hadoop-azure-3.2.0.jar,/path/to/azure-storage-7.0.0.jar,/path/to/wildfly-openssl-1.0.4.Final.jar --master local[1] pyspark-shell" +spark = SparkSession.builder +.appName("adls_fileset_test") +.config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.Gvfs") +.config("spark.hadoop.fs.gvfs.impl", "org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem") +.config("spark.hadoop.fs.gravitino.server.uri", "${GRAVITINO_SERVER_URL}") +.config("spark.hadoop.fs.gravitino.client.metalake", "test") +.config("spark.hadoop.azure-storage-account-name", "azure_account_name") +.config("spark.hadoop.azure-storage-account-key", "azure_account_name") +.config("spark.hadoop.fs.azure.skipUserGroupMetadataDuringInitialization", "true") +.config("spark.driver.memory", "2g") +.config("spark.driver.port", "2048") +.getOrCreate() + +data = [("Alice", 25), ("Bob", 30), ("Cathy", 45)] +columns = ["Name", "Age"] +spark_df = spark.createDataFrame(data, schema=columns) +gvfs_path = f"gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/people" + +spark_df.coalesce(1).write +.mode("overwrite") +.option("header", "true") +.csv(gvfs_path) +``` + +If your Spark **without Hadoop environment**, you can use the following code snippet to access the fileset: + +```python +## Replace the following code snippet with the above code snippet with the same environment variables + +os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /path/to/gravitino-azure-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar --master local[1] pyspark-shell" +``` + +- [`gravitino-azure-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure-bundle) is the Gravitino ADLS jar with Hadoop environment and `hadoop-azure` jar. +- [`gravitino-azure-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure) is a condensed version of the Gravitino ADLS bundle jar without Hadoop environment and `hadoop-azure` jar. +- `hadoop-azure-3.2.0.jar` and `azure-storage-7.0.0.jar` can be found in the Hadoop distribution in the `${HADOOP_HOME}/share/hadoop/tools/lib` directory. + + +Please choose the correct jar according to your environment. + +:::note +In some Spark versions, a Hadoop environment is needed by the driver, adding the bundle jars with '--jars' may not work. If this is the case, you should add the jars to the spark CLASSPATH directly. +::: + +### Using Gravitino virtual file system Java client to access the fileset + +```java Review Comment: added ########## docs/hadoop-catalog.md: ########## @@ -9,9 +9,9 @@ license: "This software is licensed under the Apache License version 2." ## Introduction Hadoop catalog is a fileset catalog that using Hadoop Compatible File System (HCFS) to manage -the storage location of the fileset. Currently, it supports local filesystem and HDFS. For -object storage like S3, GCS, Azure Blob Storage and OSS, you can put the hadoop object store jar like -`gravitino-aws-bundle-{gravitino-version}.jar` into the `$GRAVITINO_HOME/catalogs/hadoop/libs` directory to enable the support. +the storage location of the fileset. Currently, it supports the local filesystem and HDFS. Since 0.7.0-incubating, Gravitino supports [S3](hadoop-catalog-with-S3.md), [GCS](hadoop-catalog-with-gcs.md), [OSS](hadoop-catalog-with-oss.md) and [Azure Blob Storage](hadoop-catalog-with-adls.md) through Hadoop catalog. + +The rest of this document will use HDFS or local file as an example to illustrate how to use the Hadoop catalog. For S3, GCS, OSS and Azure Blob Storage, the configuration is similar to HDFS, please refer to the corresponding document for more details. Review Comment: Removed cloud storage configuration as suggested -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@gravitino.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org