Re: [PR] [#5472] improvement(docs): Add example to use cloud storage fileset and polish hadoop-catalog document. [gravitino]

via GitHub Mon, 13 Jan 2025 03:18:49 -0800


jerryshao commented on code in PR #6059:
URL: https://github.com/apache/gravitino/pull/6059#discussion_r1913032683



##########
docs/hadoop-catalog-with-adls.md:
##########
@@ -0,0 +1,522 @@
+---
+title: "Hadoop catalog with ADLS"
+slug: /hadoop-catalog-with-adls
+date: 2025-01-03
+keyword: Hadoop catalog ADLS
+license: "This software is licensed under the Apache License version 2."
+---
+
+This document describes how to configure a Hadoop catalog with ADLS (Azure 
Blob Storage).
+
+## Prerequisites
+
+To set up a Hadoop catalog with ADLS, follow these steps:
+
+1. Download the 
[`gravitino-azure-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure-bundle)
 file.
+2. Place the downloaded file into the Gravitino Hadoop catalog classpath at 
`${GRAVITINO_HOME}/catalogs/hadoop/libs/`.
+3. Start the Gravitino server by running the following command:
+
+```bash
+$ bin/gravitino-server.sh start
+```
+Once the server is up and running, you can proceed to configure the Hadoop 
catalog with ADLS. In the rest of this document we will use 
`http://localhost:8090` as the Gravitino server URL, please replace it with 
your actual server URL.
+
+## Configurations for creating a Hadoop catalog with ADLS
+
+### Configuration for a ADLS Hadoop catalog
+
+Apart from configurations mentioned in 
[Hadoop-catalog-catalog-configuration](./hadoop-catalog.md#catalog-properties), 
the following properties are required to configure a Hadoop catalog with ADLS:
+
+| Configuration item                | Description                              
                                                                                
                                                                                
                                      | Default value   | Required | Since 
version    |
+|-----------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------|----------|------------------|
+| `filesystem-providers`            | The file system providers to add. Set it 
to `abs` if it's a Azure Blob Storage fileset, or a comma separated string that 
contains `abs` like `oss,abs,s3` to support multiple kinds of fileset including 
`abs`.                                | (none)          | Yes      | 
0.8.0-incubating |
+| `default-filesystem-provider`     | The name default filesystem providers of 
this Hadoop catalog if users do not specify the scheme in the URI. Default 
value is `builtin-local`, for Azure Blob Storage, if we set this value, we can 
omit the prefix 'abfss://' in the location. | `builtin-local` | No       | 
0.8.0-incubating |
+| `azure-storage-account-name `     | The account name of Azure Blob Storage.  
                                                                                
                                                                                
                                      | (none)          | Yes      | 
0.8.0-incubating |
+| `azure-storage-account-key`       | The account key of Azure Blob Storage.   
                                                                                
                                                                                
                                      | (none)          | Yes      | 
0.8.0-incubating |
+
+### Configurations for a schema
+
+Refer to [Schema configurations](./hadoop-catalog.md#schema-properties) for 
more details.
+
+### Configurations for a fileset
+
+Refer to [Fileset configurations](./hadoop-catalog.md#fileset-properties) for 
more details.
+
+## Example of creating Hadoop catalog with ADLS
+
+This section demonstrates how to create the Hadoop catalog with ADLS in 
Gravitino, with a complete example.
+
+### Step1: Create a Hadoop catalog with ADLS
+
+First, you need to create a Hadoop catalog with ADLS. The following example 
shows how to create a Hadoop catalog with ADLS:
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+  "name": "example_catalog",
+  "type": "FILESET",
+  "comment": "This is a ADLS fileset catalog",
+  "provider": "hadoop",
+  "properties": {
+    "location": "abfss://contai...@account-name.dfs.core.windows.net/path",
+    "azure-storage-account-name": "The account name of the Azure Blob Storage",
+    "azure-storage-account-key": "The account key of the Azure Blob Storage",
+    "filesystem-providers": "abs"
+  }
+}' http://localhost:8090/api/metalakes/metalake/catalogs
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+    .builder("http://localhost:8090";)
+    .withMetalake("metalake")
+    .build();
+
+adlsProperties = ImmutableMap.<String, String>builder()
+    .put("location", 
"abfss://contai...@account-name.dfs.core.windows.net/path")
+    .put("azure-storage-account-name", "azure storage account name")
+    .put("azure-storage-account-key", "azure storage account key")
+    .put("filesystem-providers", "abs")
+    .build();
+
+Catalog adlsCatalog = gravitinoClient.createCatalog("example_catalog",
+    Type.FILESET,
+    "hadoop", // provider, Gravitino only supports "hadoop" for now.
+    "This is a ADLS fileset catalog",
+    adlsProperties);
+// ...
+
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient = 
GravitinoClient(uri="http://localhost:8090";, metalake_name="metalake")
+adls_properties = {
+    "location": "abfss://contai...@account-name.dfs.core.windows.net/path",
+    "azure-storage-account-name": "azure storage account name",
+    "azure-storage-account-key": "azure storage account key",
+    "filesystem-providers": "abs"
+}
+
+adls_properties = gravitino_client.create_catalog(name="example_catalog",
+                                             type=Catalog.Type.FILESET,
+                                             provider="hadoop",
+                                             comment="This is a ADLS fileset 
catalog",
+                                             properties=adls_properties)
+
+```
+
+</TabItem>
+</Tabs>
+
+### Step2: Create a schema
+
+Once the catalog is created, you can create a schema. The following example 
shows how to create a schema:
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+  "name": "test_schema",
+  "comment": "This is a ADLS schema",
+  "properties": {
+    "location": "abfss://contai...@account-name.dfs.core.windows.net/path"
+  }
+}' http://localhost:8090/api/metalakes/metalake/catalogs/test_catalog/schemas
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+Catalog catalog = gravitinoClient.loadCatalog("test_catalog");
+
+SupportsSchemas supportsSchemas = catalog.asSchemas();
+
+Map<String, String> schemaProperties = ImmutableMap.<String, String>builder()
+    .put("location", 
"abfss://contai...@account-name.dfs.core.windows.net/path")
+    .build();
+Schema schema = supportsSchemas.createSchema("test_schema",
+    "This is a ADLS schema",
+    schemaProperties
+);
+// ...
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient = 
GravitinoClient(uri="http://localhost:8090";, metalake_name="metalake")
+catalog: Catalog = gravitino_client.load_catalog(name="test_catalog")
+catalog.as_schemas().create_schema(name="test_schema",
+                                   comment="This is a ADLS schema",
+                                   properties={"location": 
"abfss://contai...@account-name.dfs.core.windows.net/path"})
+```
+
+</TabItem>
+</Tabs>
+
+### Step3: Create a fileset
+
+After creating the schema, you can create a fileset. The following example 
shows how to create a fileset:
+
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+  "name": "example_fileset",
+  "comment": "This is an example fileset",
+  "type": "MANAGED",
+  "storageLocation": 
"abfss://contai...@account-name.dfs.core.windows.net/path/example_fileset",
+  "properties": {
+    "k1": "v1"
+  }
+}' 
http://localhost:8090/api/metalakes/metalake/catalogs/test_catalog/schemas/test_schema/filesets
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+    .builder("http://localhost:8090";)
+    .withMetalake("metalake")
+    .build();
+
+Catalog catalog = gravitinoClient.loadCatalog("test_catalog");
+FilesetCatalog filesetCatalog = catalog.asFilesetCatalog();
+
+Map<String, String> propertiesMap = ImmutableMap.<String, String>builder()
+        .put("k1", "v1")
+        .build();
+
+filesetCatalog.createFileset(
+  NameIdentifier.of("test_schema", "example_fileset"),
+  "This is an example fileset",
+  Fileset.Type.MANAGED,
+  "abfss://contai...@account-name.dfs.core.windows.net/path/example_fileset",
+  propertiesMap,
+);
+```
+
+</TabItem>
+<TabItem value="python" label="Python">
+
+```python
+gravitino_client: GravitinoClient = 
GravitinoClient(uri="http://localhost:8090";, metalake_name="metalake")
+
+catalog: Catalog = gravitino_client.load_catalog(name="test_catalog")
+catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("test_schema",
 "example_fileset"),
+                                            type=Fileset.Type.MANAGED,
+                                            comment="This is an example 
fileset",
+                                            
storage_location="abfss://contai...@account-name.dfs.core.windows.net/path/example_fileset",
+                                            properties={"k1": "v1"})
+```
+
+</TabItem>
+</Tabs>
+
+## Accessing a fileset with ADLS
+
+### Using the GVFS Java client to access the fileset
+
+To access fileset with Azure Blob Storage(ADLS) using the GVFS Java client, 
based on the [basic GVFS configurations](./how-to-use-gvfs.md#configuration-1), 
you need to add the following configurations:
+
+| Configuration item           | Description                             | 
Default value | Required | Since version    |
+|------------------------------|-----------------------------------------|---------------|----------|------------------|
+| `azure-storage-account-name` | The account name of Azure Blob Storage. | 
(none)        | Yes      | 0.8.0-incubating |
+| `azure-storage-account-key`  | The account key of Azure Blob Storage.  | 
(none)        | Yes      | 0.8.0-incubating |
+
+:::note
+If the catalog has enabled [credential 
vending](security/credential-vending.md), the properties above can be omitted.
+:::
+
+```java
+Configuration conf = new Configuration();
+conf.set("fs.AbstractFileSystem.gvfs.impl","org.apache.gravitino.filesystem.hadoop.Gvfs");
+conf.set("fs.gvfs.impl","org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem");
+conf.set("fs.gravitino.server.uri","http://localhost:8090";);
+conf.set("fs.gravitino.client.metalake","test_metalake");
+conf.set("azure-storage-account-name", "account_name_of_adls");
+conf.set("azure-storage-account-key", "account_key_of_adls");
+Path filesetPath = new 
Path("gvfs://fileset/test_catalog/test_schema/test_fileset/new_dir");
+FileSystem fs = filesetPath.getFileSystem(conf);
+fs.mkdirs(filesetPath);
+...
+```
+
+Similar to Spark configurations, you need to add ADLS bundle jars to the 
classpath according to your environment.
+
+If your wants to custom your hadoop version or there is already a hadoop 
version in your project, you can add the following dependencies to your 
`pom.xml`:
+
+```xml
+  <dependency>
+    <groupId>org.apache.hadoop</groupId>
+    <artifactId>hadoop-common</artifactId>
+    <version>${HADOOP_VERSION}</version>
+  </dependency>
+
+  <dependency>
+    <groupId>org.apache.hadoop</groupId>
+    <artifactId>hadoop-azure</artifactId>
+    <version>${HADOOP_VERSION}</version>
+  </dependency>
+
+  <dependency>
+    <groupId>org.apache.gravitino</groupId>
+    <artifactId>filesystem-hadoop3-runtime</artifactId>
+    <version>0.8.0-incubating-SNAPSHOT</version>
+  </dependency>
+
+  <dependency>
+    <groupId>org.apache.gravitino</groupId>
+    <artifactId>gravitino-azure</artifactId>
+    <version>0.8.0-incubating-SNAPSHOT</version>
+  </dependency>
+```
+
+Or use the bundle jar with Hadoop environment:
+
+```xml
+  <dependency>
+    <groupId>org.apache.gravitino</groupId>
+    <artifactId>gravitino-azure-bundle</artifactId>
+    <version>0.8.0-incubating-SNAPSHOT</version>
+  </dependency>
+
+  <dependency>
+    <groupId>org.apache.gravitino</groupId>
+    <artifactId>filesystem-hadoop3-runtime</artifactId>
+    <version>0.8.0-incubating-SNAPSHOT</version>
+  </dependency>
+```
+
+### Using Spark to access the fileset
+
+The following code snippet shows how to use **PySpark 3.1.3 with Hadoop 
environment(Hadoop 3.2.0)** to access the fileset:
+
+Before running the following code, you need to install required packages:
+
+```bash
+pip install pyspark==3.1.3
+pip install gravitino==0.8.0-incubating
+```
+Then you can run the following code:
+
+```python
+import logging
+from gravitino import NameIdentifier, GravitinoClient, Catalog, Fileset, 
GravitinoAdminClient
+from pyspark.sql import SparkSession
+import os
+
+gravitino_url = "http://localhost:8090";
+metalake_name = "test"
+
+catalog_name = "your_adls_catalog"
+schema_name = "your_adls_schema"
+fileset_name = "your_adls_fileset"
+
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars 
/path/to/gravitino-azure-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar,/path/to/hadoop-azure-3.2.0.jar,/path/to/azure-storage-7.0.0.jar,/path/to/wildfly-openssl-1.0.4.Final.jar
 --master local[1] pyspark-shell"
+spark = SparkSession.builder
+.appName("adls_fileset_test")
+.config("spark.hadoop.fs.AbstractFileSystem.gvfs.impl", 
"org.apache.gravitino.filesystem.hadoop.Gvfs")
+.config("spark.hadoop.fs.gvfs.impl", 
"org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem")
+.config("spark.hadoop.fs.gravitino.server.uri", "http://localhost:8090";)
+.config("spark.hadoop.fs.gravitino.client.metalake", "test")
+.config("spark.hadoop.azure-storage-account-name", "azure_account_name")
+.config("spark.hadoop.azure-storage-account-key", "azure_account_name")
+.config("spark.hadoop.fs.azure.skipUserGroupMetadataDuringInitialization", 
"true")
+.config("spark.driver.memory", "2g")
+.config("spark.driver.port", "2048")
+.getOrCreate()
+
+data = [("Alice", 25), ("Bob", 30), ("Cathy", 45)]
+columns = ["Name", "Age"]
+spark_df = spark.createDataFrame(data, schema=columns)
+gvfs_path = 
f"gvfs://fileset/{catalog_name}/{schema_name}/{fileset_name}/people"
+
+spark_df.coalesce(1).write
+.mode("overwrite")
+.option("header", "true")
+.csv(gvfs_path)
+```
+
+If your Spark **without Hadoop environment**, you can use the following code 
snippet to access the fileset:
+
+```python
+## Replace the following code snippet with the above code snippet with the 
same environment variables
+
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars 
/path/to/gravitino-azure-bundle-{gravitino-version}.jar,/path/to/gravitino-filesystem-hadoop3-runtime-{gravitino-version}.jar
 --master local[1] pyspark-shell"
+```
+
+- 
[`gravitino-azure-bundle-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure-bundle)
 is the Gravitino ADLS jar with Hadoop environment(3.3.1) and `hadoop-azure` 
jar.
+- 
[`gravitino-azure-${gravitino-version}.jar`](https://mvnrepository.com/artifact/org.apache.gravitino/gravitino-azure)
 is a condensed version of the Gravitino ADLS bundle jar without Hadoop 
environment and `hadoop-azure` jar.
+- `hadoop-azure-3.2.0.jar` and `azure-storage-7.0.0.jar` can be found in the 
Hadoop distribution in the `${HADOOP_HOME}/share/hadoop/tools/lib` directory.
+
+
+Please choose the correct jar according to your environment.
+
+:::note
+In some Spark versions, a Hadoop environment is necessary for the driver, 
adding the bundle jars with '--jars' may not work. If this is the case, you 
should add the jars to the spark CLASSPATH directly.
+:::
+
+### Accessing a fileset using the Hadoop fs command
+
+The following are examples of how to use the `hadoop fs` command to access the 
fileset in Hadoop 3.1.3.
+
+1. Adding the following contents to the 
`${HADOOP_HOME}/etc/hadoop/core-site.xml` file:
+
+```xml
+  <property>
+    <name>fs.AbstractFileSystem.gvfs.impl</name>
+    <value>org.apache.gravitino.filesystem.hadoop.Gvfs</value>
+  </property>
+
+  <property>
+    <name>fs.gvfs.impl</name>
+    
<value>org.apache.gravitino.filesystem.hadoop.GravitinoVirtualFileSystem</value>
+  </property>
+
+  <property>
+    <name>fs.gravitino.server.uri</name>
+    <value>http://localhost:8090</value>
+  </property>
+
+  <property>
+    <name>fs.gravitino.client.metalake</name>
+    <value>test</value>
+  </property>
+
+  <property>
+    <name>azure-storage-account-name</name>
+    <value>account_name</value>
+  </property>
+  <property>
+    <name>azure-storage-account-key</name>
+    <value>account_key</value>
+  </property>
+```
+
+2. Add the necessary jars to the Hadoop classpath.
+
+For ADLS, you need to add 
`gravitino-filesystem-hadoop3-runtime-${gravitino-version}.jar`, 
`gravitino-azure-${gravitino-version}.jar` and 
`hadoop-azure-${hadoop-version}.jar` located at 
`${HADOOP_HOME}/share/hadoop/tools/lib/` to the Hadoop classpath. 
+
+3. Run the following command to access the fileset:
+
+```shell
+./${HADOOP_HOME}/bin/hadoop dfs -ls 
gvfs://fileset/adls_catalog/adls_schema/adls_fileset
+./${HADOOP_HOME}/bin/hadoop dfs -put /path/to/local/file 
gvfs://fileset/adls_catalog/adls_schema/adls_fileset
+```
+
+### Using the GVFS Python client to access a fileset
+
+In order to access fileset with Azure Blob storage (ADLS) using the GVFS 
Python client, apart from [basic GVFS 
configurations](./how-to-use-gvfs.md#configuration-1), you need to add the 
following configurations:
+
+| Configuration item | Description                            | Default value 
| Required | Since version    |
+|--------------------|----------------------------------------|---------------|----------|------------------|
+| `abs_account_name` | The account name of Azure Blob Storage | (none)        
| Yes      | 0.8.0-incubating |
+| `abs_account_key`  | The account key of Azure Blob Storage  | (none)        
| Yes      | 0.8.0-incubating |
+
+:::
+If the catalog has enabled [credential 
vending](security/credential-vending.md), the properties above can be omitted.
+:::
+
+Please install the `gravitino` package before running the following code:
+
+```bash
+pip install gravitino==0.8.0-incubating

Review Comment:
   It is the `apache-gravitino`, not gravitino, also please remove all the 
hardcoded version number in the docs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@gravitino.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [#5472] improvement(docs): Add example to use cloud storage fileset and polish hadoop-catalog document. [gravitino]

Reply via email to