This is an automated email from the ASF dual-hosted git repository. jiayu pushed a commit to branch docs-mar-07 in repository https://gitbox.apache.org/repos/asf/sedona.git
commit 26e8f81b3fa069c150c7531e47310e0ac547cd85 Author: Jia Yu <[email protected]> AuthorDate: Sun Mar 9 23:52:49 2025 -0700 Refine STAC catalog reader --- .../files/stac-sedona-spark.md} | 148 +++++++------- docs/tutorial/sql.md | 225 +-------------------- mkdocs.yml | 2 +- 3 files changed, 78 insertions(+), 297 deletions(-) diff --git a/docs/api/sql/Stac.md b/docs/tutorial/files/stac-sedona-spark.md similarity index 86% rename from docs/api/sql/Stac.md rename to docs/tutorial/files/stac-sedona-spark.md index 8d56644e5e..09bcba2328 100644 --- a/docs/api/sql/Stac.md +++ b/docs/tutorial/files/stac-sedona-spark.md @@ -17,6 +17,8 @@ under the License. --> +# STAC catalog with Apache Sedona and Spark + The STAC data source allows you to read data from a SpatioTemporal Asset Catalog (STAC) API. The data source supports reading STAC items and collections. ## Usage @@ -108,29 +110,29 @@ root +------------+--------------------+-------+--------------------+--------------------+--------------------+-----+-----------+--------------------+--------------+------------+--------------------+--------------------+-----------+-----------+-------------+-------+----+--------------------+--------------------+--------------------+ ``` -# Filter Pushdown +## Filter Pushdown The STAC data source supports predicate pushdown for spatial and temporal filters. The data source can push down spatial and temporal filters to the underlying data source to reduce the amount of data that needs to be read. -## Spatial Filter Pushdown +### Spatial Filter Pushdown Spatial filter pushdown allows the data source to apply spatial predicates (e.g., st_contains, st_intersects) directly at the data source level, reducing the amount of data transferred and processed. -## Temporal Filter Pushdown +### Temporal Filter Pushdown Temporal filter pushdown allows the data source to apply temporal predicates (e.g., BETWEEN, >=, <=) directly at the data source level, similarly reducing the amount of data transferred and processed. -# Examples +## Examples Here are some examples demonstrating how to query a STAC data source that is loaded into a table named `STAC_TABLE`. -## SQL Select Without Filters +### SQL Select Without Filters ```sql SELECT id, datetime as dt, geometry, bbox FROM STAC_TABLE ``` -## SQL Select With Temporal Filter +### SQL Select With Temporal Filter ```sql SELECT id, datetime as dt, geometry, bbox @@ -140,7 +142,7 @@ SELECT id, datetime as dt, geometry, bbox FROM STAC_TABLE In this example, the data source will push down the temporal filter to the underlying data source. -## SQL Select With Spatial Filter +### SQL Select With Spatial Filter ```sql SELECT id, geometry @@ -150,7 +152,7 @@ In this example, the data source will push down the temporal filter to the under In this example, the data source will push down the spatial filter to the underlying data source. -## Sedona Configuration for STAC Reader +### Sedona Configuration for STAC Reader When using the STAC reader in Sedona, several configuration options can be set to control the behavior of the reader. These configurations are typically set in a `Map[String, String]` and passed to the reader. Below are the key sedona configuration options: @@ -192,73 +194,13 @@ These configurations can be combined into a single `Map[String, String]` and pas These options above provide fine-grained control over how the STAC data is read and processed in Sedona. -# Python API +## Python API The Python API allows you to interact with a SpatioTemporal Asset Catalog (STAC) API using the Client class. This class provides methods to open a connection to a STAC API, retrieve collections, and search for items with various filters. -## Client Class - -## Methods - -### `open(url: str) -> Client` - -Opens a connection to the specified STAC API URL. - -**Parameters:** - -- `url` (*str*): The URL of the STAC API to connect to. - **Example:** `"https://planetarycomputer.microsoft.com/api/stac/v1"` - -**Returns:** - -- `Client`: An instance of the `Client` class connected to the specified URL. - ---- - -### `get_collection(collection_id: str) -> CollectionClient` - -Retrieves a collection client for the specified collection ID. - -**Parameters:** - -- `collection_id` (*str*): The ID of the collection to retrieve. - **Example:** `"aster-l1t"` - -**Returns:** +### Sample Code -- `CollectionClient`: An instance of the `CollectionClient` class for the specified collection. - ---- - -### `search(*ids: Union[str, list], collection_id: str, bbox: Optional[list] = None, datetime: Optional[Union[str, datetime.datetime, list]] = None, max_items: Optional[int] = None, return_dataframe: bool = True) -> Union[Iterator[PyStacItem], DataFrame]` - -Searches for items in the specified collection with optional filters. - -**Parameters:** - -- `ids` (*Union[str, list]*): A variable number of item IDs to filter the items. - **Example:** `"item_id1"` or `["item_id1", "item_id2"]` -- `collection_id` (*str*): The ID of the collection to search in. - **Example:** `"aster-l1t"` -- `bbox` (*Optional[list]*): A list of bounding boxes for filtering the items. Each bounding box is represented as a list of four float values: `[min_lon, min_lat, max_lon, max_lat]`. - **Example:** `[[ -180.0, -90.0, 180.0, 90.0 ]]` -- `datetime` (*Optional[Union[str, datetime.datetime, list]]*): A single datetime, RFC 3339-compliant timestamp, or a list of date-time ranges for filtering the items. - **Example:** - - `"2020-01-01T00:00:00Z"` - - `datetime.datetime(2020, 1, 1)` - - `[["2020-01-01T00:00:00Z", "2021-01-01T00:00:00Z"]]` -- `max_items` (*Optional[int]*): The maximum number of items to return from the search, even if there are more matching results. - **Example:** `100` -- `return_dataframe` (*bool*): If `True` (default), return the result as a Spark DataFrame instead of an iterator of `PyStacItem` objects. - **Example:** `True` - -**Returns:** - -- *Union[Iterator[PyStacItem], DataFrame]*: An iterator of `PyStacItem` objects or a Spark DataFrame that matches the specified filters. - -## Sample Code - -### Initialize the Client +#### Initialize the Client ```python from sedona.stac.client import Client @@ -267,7 +209,7 @@ from sedona.stac.client import Client client = Client.open("https://planetarycomputer.microsoft.com/api/stac/v1") ``` -### Search Items on a Collection Within a Year +#### Search Items on a Collection Within a Year ```python items = client.search( @@ -275,7 +217,7 @@ items = client.search( ) ``` -### Search Items on a Collection Within a Month and Max Items +#### Search Items on a Collection Within a Month and Max Items ```python items = client.search( @@ -283,7 +225,7 @@ items = client.search( ) ``` -### Search Items with Bounding Box and Interval +#### Search Items with Bounding Box and Interval ```python items = client.search( @@ -295,14 +237,14 @@ items = client.search( ) ``` -### Search Multiple Items with Multiple Bounding Boxes +#### Search Multiple Items with Multiple Bounding Boxes ```python bbox_list = [[-180.0, -90.0, 180.0, 90.0], [-100.0, -50.0, 100.0, 50.0]] items = client.search(collection_id="aster-l1t", bbox=bbox_list, return_dataframe=False) ``` -### Search Items and Get DataFrame as Return with Multiple Intervals +#### Search Items and Get DataFrame as Return with Multiple Intervals ```python interval_list = [ @@ -315,7 +257,7 @@ df = client.search( df.show() ``` -### Save Items in DataFrame to GeoParquet with Both Bounding Boxes and Intervals +#### Save Items in DataFrame to GeoParquet with Both Bounding Boxes and Intervals ```python # Save items in DataFrame to GeoParquet with both bounding boxes and intervals @@ -326,7 +268,57 @@ client.get_collection("aster-l1t").save_to_geoparquet( These examples demonstrate how to use the Client class to search for items in a STAC collection with various filters and return the results as either an iterator of PyStacItem objects or a Spark DataFrame. -# References +### Methods + +**`open(url: str) -> Client`** +Opens a connection to the specified STAC API URL. + +Parameters: +* `url` (*str*): The URL of the STAC API to connect to. + * Example: `"https://planetarycomputer.microsoft.com/api/stac/v1"` + +Returns: +* `Client`: An instance of the `Client` class connected to the specified URL. + +--- + +**`get_collection(collection_id: str) -> CollectionClient`** +Retrieves a collection client for the specified collection ID. + +Parameters: +* `collection_id` (*str*): The ID of the collection to retrieve. + * Example: `"aster-l1t"` + +Returns: +* `CollectionClient`: An instance of the `CollectionClient` class for the specified collection. + +--- + +**`search(*ids: Union[str, list], collection_id: str, bbox: Optional[list] = None, datetime: Optional[Union[str, datetime.datetime, list]] = None, max_items: Optional[int] = None, return_dataframe: bool = True) -> Union[Iterator[PyStacItem], DataFrame]`** +Searches for items in the specified collection with optional filters. + +Parameters: + +* `ids` (*Union[str, list]*): A variable number of item IDs to filter the items. + * Example: `"item_id1"` or `["item_id1", "item_id2"]` +* `collection_id` (*str*): The ID of the collection to search in. + * Example: `"aster-l1t"` +* `bbox` (*Optional[list]*): A list of bounding boxes for filtering the items. Each bounding box is represented as a list of four float values: `[min_lon, min_lat, max_lon, max_lat]`. + * Example: `[[ -180.0, -90.0, 180.0, 90.0 ]]` +* `datetime` (*Optional[Union[str, datetime.datetime, list]]*): A single datetime, RFC 3339-compliant timestamp, or a list of date-time ranges for filtering the items. + * Examples: + * `"2020-01-01T00:00:00Z"` + * `datetime.datetime(2020, 1, 1)` + * `[["2020-01-01T00:00:00Z", "2021-01-01T00:00:00Z"]]` +* `max_items` (*Optional[int]*): The maximum number of items to return from the search, even if there are more matching results. + * Example: `100` +* `return_dataframe` (*bool*): If `True` (default), return the result as a Spark DataFrame instead of an iterator of `PyStacItem` objects. + * Example: `True` + +Returns: +* *Union[Iterator[PyStacItem], DataFrame]*: An iterator of `PyStacItem` objects or a Spark DataFrame that matches the specified filters. + +## References - STAC Specification: https://stacspec.org/ diff --git a/docs/tutorial/sql.md b/docs/tutorial/sql.md index 4ea1ff0754..fecbf3ef82 100644 --- a/docs/tutorial/sql.md +++ b/docs/tutorial/sql.md @@ -20,7 +20,7 @@ The page outlines the steps to manage spatial data using SedonaSQL. !!!note - Since v`1.5.0`, Sedona assumes geographic coordinates to be in longitude/latitude order. If your data is lat/lon order, please use `ST_FlipCoordinates` to swap X and Y. + Sedona assumes geographic coordinates to be in longitude/latitude order. If your data is lat/lon order, please use `ST_FlipCoordinates` to swap X and Y. SedonaSQL supports SQL/MM Part3 Spatial SQL Standard. It includes four kinds of SQL operators as follows. All these operators can be directly called through: @@ -64,8 +64,6 @@ Detailed SedonaSQL APIs are available here: [SedonaSQL API](../api/sql/Overview. Use the following code to create your Sedona config at the beginning. If you already have a SparkSession (usually named `spark`) created by AWS EMR/Databricks/Microsoft Fabric, please ==skip this step==. -==Sedona >= 1.4.1== - You can add additional Spark runtime config to the config builder. For example, `SedonaContext.builder().config("spark.sql.autoBroadcastJoinThreshold", "10485760")` === "Scala" @@ -111,65 +109,10 @@ You can add additional Spark runtime config to the config builder. For example, ``` If you are using a different Spark version, please replace the `3.3` in package name of sedona-spark-shaded with the corresponding major.minor version of Spark, such as `sedona-spark-shaded-3.4_2.12:{{ sedona.current_version }}`. -==Sedona < 1.4.1== - -The following method has been deprecated since Sedona 1.4.1. Please use the method above to create your Sedona config. - -=== "Scala" - - ```scala - var sparkSession = SparkSession.builder() - .master("local[*]") // Delete this if run in cluster mode - .appName("readTestScala") // Change this to a proper name - // Enable Sedona custom Kryo serializer - .config("spark.serializer", classOf[KryoSerializer].getName) // org.apache.spark.serializer.KryoSerializer - .config("spark.kryo.registrator", classOf[SedonaKryoRegistrator].getName) - .getOrCreate() // org.apache.sedona.core.serde.SedonaKryoRegistrator - ``` - If you use SedonaViz together with SedonaSQL, please use the following two lines to enable Sedona Kryo serializer instead: - ```scala - .config("spark.serializer", classOf[KryoSerializer].getName) // org.apache.spark.serializer.KryoSerializer - .config("spark.kryo.registrator", classOf[SedonaVizKryoRegistrator].getName) // org.apache.sedona.viz.core.Serde.SedonaVizKryoRegistrator - ``` - -=== "Java" - - ```java - SparkSession sparkSession = SparkSession.builder() - .master("local[*]") // Delete this if run in cluster mode - .appName("readTestJava") // Change this to a proper name - // Enable Sedona custom Kryo serializer - .config("spark.serializer", KryoSerializer.class.getName()) // org.apache.spark.serializer.KryoSerializer - .config("spark.kryo.registrator", SedonaKryoRegistrator.class.getName()) - .getOrCreate() // org.apache.sedona.core.serde.SedonaKryoRegistrator - ``` - If you use SedonaViz together with SedonaSQL, please use the following two lines to enable Sedona Kryo serializer instead: - ```java - .config("spark.serializer", KryoSerializer.class.getName()) // org.apache.spark.serializer.KryoSerializer - .config("spark.kryo.registrator", SedonaVizKryoRegistrator.class.getName()) // org.apache.sedona.viz.core.Serde.SedonaVizKryoRegistrator - ``` - -=== "Python" - - ```python - sparkSession = SparkSession. \ - builder. \ - appName('readTestPython'). \ - config("spark.serializer", KryoSerializer.getName()). \ - config("spark.kryo.registrator", SedonaKryoRegistrator.getName()). \ - config('spark.jars.packages', - 'org.apache.sedona:sedona-spark-shaded-3.3_2.12:{{ sedona.current_version }},' - 'org.datasyslab:geotools-wrapper:{{ sedona.current_geotools }}'). \ - getOrCreate() - ``` - If you are using Spark versions >= 3.4, please replace the `3.0` in package name of sedona-spark-shaded with the corresponding major.minor version of Spark, such as `sedona-spark-shaded-3.4_2.12:{{ sedona.current_version }}`. - ## Initiate SedonaContext Add the following line after creating Sedona config. If you already have a SparkSession (usually named `spark`) created by AWS EMR/Databricks/Microsoft Fabric, please call `sedona = SedonaContext.create(spark)` instead. For ==Databricks==, the situation is more complicated, please refer to [Databricks setup guide](../setup/databricks.md), but generally you don't need to create SedonaContext. -==Sedona >= 1.4.1== - === "Scala" ```scala @@ -194,30 +137,6 @@ Add the following line after creating Sedona config. If you already have a Spark sedona = SedonaContext.create(config) ``` -==Sedona < 1.4.1== - -The following method has been deprecated since Sedona 1.4.1. Please use the method above to create your SedonaContext. - -=== "Scala" - - ```scala - SedonaSQLRegistrator.registerAll(sparkSession) - ``` - -=== "Java" - - ```java - SedonaSQLRegistrator.registerAll(sparkSession) - ``` - -=== "Python" - - ```python - from sedona.register import SedonaRegistrator - - SedonaRegistrator.registerAll(spark) - ``` - You can also register everything by passing `--conf spark.sql.extensions=org.apache.sedona.sql.SedonaSqlExtensions` to `spark-submit` or `spark-shell`. ## Load data from files @@ -315,56 +234,6 @@ root Since `v1.6.1`, Sedona supports reading GeoJSON files using the `geojson` data source. It is designed to handle JSON files that use [GeoJSON format](https://datatracker.ietf.org/doc/html/rfc7946) for their geometries. -This includes SpatioTemporal Asset Catalog (STAC) files, GeoJSON features, GeoJSON feature collections and other variations. -The key functionality lies in the way 'geometry' fields are processed: these are specifically read as Sedona's `GeometryUDT` type, ensuring integration with Sedona's suite of spatial functions. - -### Key features - -- Broad Support: The reader and writer are versatile, supporting all GeoJSON-formatted files, including STAC files, feature collections, and more. -- Geometry Transformation: When reading, fields named 'geometry' are automatically converted from GeoJSON format to Sedona's `GeometryUDT` type and vice versa when writing. - -### Load MultiLine GeoJSON FeatureCollection - -Suppose we have a GeoJSON FeatureCollection file as follows. -This entire file is considered as a single GeoJSON FeatureCollection object. -Multiline format is preferable for scenarios where files need to be human-readable or manually edited. - -```json -{ "type": "FeatureCollection", - "features": [ - { "type": "Feature", - "geometry": {"type": "Point", "coordinates": [102.0, 0.5]}, - "properties": {"prop0": "value0"} - }, - { "type": "Feature", - "geometry": { - "type": "LineString", - "coordinates": [ - [102.0, 0.0], [103.0, 1.0], [104.0, 0.0], [105.0, 1.0] - ] - }, - "properties": { - "prop0": "value1", - "prop1": 0.0 - } - }, - { "type": "Feature", - "geometry": { - "type": "Polygon", - "coordinates": [ - [ [100.0, 0.0], [101.0, 0.0], [101.0, 1.0], - [100.0, 1.0], [100.0, 0.0] ] - ] - }, - "properties": { - "prop0": "value2", - "prop1": {"this": "that"} - } - } - ] -} -``` - Set the `multiLine` option to `True` to read multiline GeoJSON files. === "Python" @@ -402,81 +271,7 @@ Set the `multiLine` option to `True` to read multiline GeoJSON files. df.printSchema(); ``` -The output is as follows: - -``` -+--------------------+------+ -| geometry| prop0| -+--------------------+------+ -| POINT (102 0.5)|value0| -|LINESTRING (102 0...|value1| -|POLYGON ((100 0, ...|value2| -+--------------------+------+ - -root - |-- geometry: geometry (nullable = false) - |-- prop0: string (nullable = true) - -``` - -### Load Single Line GeoJSON Features - -Suppose we have a single-line GeoJSON Features dataset as follows. Each line is a single GeoJSON Feature. -This format is efficient for processing large datasets where each line is a separate, self-contained GeoJSON object. - -```json -{"type":"Feature","geometry":{"type":"Point","coordinates":[102.0,0.5]},"properties":{"prop0":"value0"}} -{"type":"Feature","geometry":{"type":"LineString","coordinates":[[102.0,0.0],[103.0,1.0],[104.0,0.0],[105.0,1.0]]},"properties":{"prop0":"value1"}} -{"type":"Feature","geometry":{"type":"Polygon","coordinates":[[[100.0,0.0],[101.0,0.0],[101.0,1.0],[100.0,1.0],[100.0,0.0]]]},"properties":{"prop0":"value2"}} -``` - -By default, when `option` is not specified, Sedona reads a GeoJSON file as a single line GeoJSON. - -=== "Python" - - ```python - df = sedona.read.format("geojson").load("PATH/TO/MYFILE.json") - .withColumn("prop0", f.expr("properties['prop0']")).drop("properties").drop("type") - - df.show() - df.printSchema() - ``` - -=== "Scala" - - ```scala - val df = sedona.read.format("geojson").load("PATH/TO/MYFILE.json") - .withColumn("prop0", expr("properties['prop0']")).drop("properties").drop("type") - - df.show() - df.printSchema() - ``` - -=== "Java" - - ```java - Dataset<Row> df = sedona.read.format("geojson").load("PATH/TO/MYFILE.json") - .withColumn("prop0", expr("properties['prop0']")).drop("properties").drop("type") - - df.show() - df.printSchema() - ``` - -The output is as follows: - -``` -+--------------------+------+ -| geometry| prop0| -+--------------------+------+ -| POINT (102 0.5)|value0| -|LINESTRING (102 0...|value1| -|POLYGON ((100 0, ...|value2| -+--------------------+------+ - -root - |-- geometry: geometry (nullable = false) - |-- prop0: string (nullable = true) -``` +See [this page](files/geojson-sedona-spark.md) for more information on loading GeoJSON files. ## Load Shapefile @@ -502,7 +297,7 @@ Since v`1.7.0`, Sedona supports loading Shapefile as a DataFrame. The input path can be a directory containing one or multiple shapefiles, or path to a `.shp` file. -See [this page](../files/shapefile-sedona-spark) for more information on loading Shapefiles. +See [this page](files/shapefile-sedona-spark.md) for more information on loading Shapefiles. ## Load GeoParquet @@ -550,7 +345,7 @@ Please refer to [Reading Legacy Parquet Files](../api/sql/Reading-legacy-parquet GeoParquet file reader does not work on Databricks runtime when Photon is enabled. Please disable Photon when using GeoParquet file reader on Databricks runtime. -See [this page](../files/geoparquet-sedona-spark) for more information on loading GeoParquet. +See [this page](files/geoparquet-sedona-spark.md) for more information on loading GeoParquet. ## Load data from JDBC data sources @@ -634,7 +429,7 @@ Since v1.7.0, Sedona supports loading Geopackage file format as a DataFrame. df = sedona.read.format("geopackage").option("tableName", "tab").load("/path/to/geopackage") ``` -See [this page](../files/geopackage-sedona-spark) for more information on loading GeoPackage. +See [this page](files/geopackage-sedona-spark.md) for more information on loading GeoPackage. ## Load from OSM PBF @@ -1401,13 +1196,7 @@ Since `v1.6.1`, the GeoJSON data source in Sedona can be used to save a Spatial df.write.format("geojson").save("YOUR/PATH.json") ``` -The structure of the generated file will be like this: - -```json -{"type":"Feature","geometry":{"type":"Point","coordinates":[102.0,0.5]},"properties":{"prop0":"value0"}} -{"type":"Feature","geometry":{"type":"LineString","coordinates":[[102.0,0.0],[103.0,1.0],[104.0,0.0],[105.0,1.0]]},"properties":{"prop0":"value1"}} -{"type":"Feature","geometry":{"type":"Polygon","coordinates":[[[100.0,0.0],[101.0,0.0],[101.0,1.0],[100.0,1.0],[100.0,0.0]]]},"properties":{"prop0":"value2"}} -``` +See [this page](files/geojson-sedona-spark.md) for more information on writing to GeoJSON. ## Save GeoParquet @@ -1417,7 +1206,7 @@ Since v`1.3.0`, Sedona natively supports writing GeoParquet file. GeoParquet can df.write.format("geoparquet").save(geoparquetoutputlocation + "/GeoParquet_File_Name.parquet") ``` -See [this page](../files/geoparquet-sedona-spark) for more information on writing to GeoParquet. +See [this page](files/geoparquet-sedona-spark.md) for more information on writing to GeoParquet. ## Save to Postgis diff --git a/mkdocs.yml b/mkdocs.yml index 2db776b633..60618b5a42 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -66,6 +66,7 @@ nav: - GeoParquet: tutorial/files/geoparquet-sedona-spark.md - GeoJSON: tutorial/files/geojson-sedona-spark.md - Shapefiles: tutorial/files/shapefiles-sedona-spark.md + - STAC catalog: tutorial/files/stac-sedona-spark.md - Concepts: - Spatial Joins: tutorial/concepts/spatial-joins.md - Clustering Algorithms: tutorial/concepts/clustering-algorithms.md @@ -97,7 +98,6 @@ nav: - Query optimization: api/sql/Optimizer.md - Nearest-Neighbour searching: api/sql/NearestNeighbourSearching.md - "Spider:Spatial Data Generator": api/sql/Spider.md - - Reading STAC Data Source: api/sql/Stac.md - Reading Legacy Parquet Files: api/sql/Reading-legacy-parquet.md - Visualization: - SedonaPyDeck: api/sql/Visualization_SedonaPyDeck.md
