Copilot commented on code in PR #2418: URL: https://github.com/apache/sedona/pull/2418#discussion_r2449947458
########## docs/blog/posts/spatial-tables-lakehouse.md: ########## @@ -0,0 +1,528 @@ +--- +date: + created: 2025-10-21 +links: + - SedonaDB: https://sedona.apache.org/sedonadb/ + - SpatialBench: https://sedona.apache.org/spatialbench/ +authors: + - matt_powers +title: "Managing spatial tables in Data Lakehouses with Iceberg" +--- + +<!-- +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +--> + +This post explains the benefits of the Lakehouse Architecture for spatial tables and how Lakehouses differ from data warehouses and data lakes. + +<!-- more --> + +Many of the benefits that Lakehouses (e.g., Iceberg, Delta Lake, Hudi) provide for tabular data also apply to spatial data, for example: + +* Reliable transactions +* Versioned data +* Time travel +* Schema enforcement +* Optimizations +* And many more + +The spatial data community can use Lakehouses as of Iceberg v3, which adds geometry/geography types. + +Spatial data requires different types of metadata and optimizations, but doesn’t always require entirely different file formats. Iceberg can now store the metadata needed for geometry and geography columns. It’s excellent that you can now use Iceberg for tabular and spatial data. + +This post also explains why Lakehouse architecture is often better than warehouses and data lakes. Let’s start with a detailed description of the Lakehouse architecture. + +## Data Lakehouse Architecture Overview + +A Lakehouse stores data in a Lakehouse Storage System like Iceberg, Delta Lake, or Hudi. Iceberg v3 is currently the best option for spatial data because it natively supports geometry and geography columns. + +Tables in Lakehouses are governed by a catalog like Unity Catalog or Apache Polaris. The catalog allows role-based access control (RBAC) and features like multi-table transactions. + +You can query tables in the Lakehouse Architecture for BI, reporting, data science, machine learning, and other complex analyses. + +The following table illustrates the Lakehouse Architecture: + +{ align=center width="80%" } + +The Lakehouse Architecture offers several advantages: + +* Data is stored in open formats so any engine can query it, and there is no vendor lock-in. +* Lakehouses support all the features familiar to data warehouses, like reliable transactions, DML operations, and role-based access control. +* Lakehouses are often performant enough for low-latency applications like BI dashboards. +* Lakehouses are interoperable with proprietary tools like BigQuery, Redshift, or Esri. +* You can store Lakehouses in cloud-based storage systems without any additional charges. +* Lakehouses are compatible with any engine. You can use one engine for ingestion, another for ETL, and a third for ML. The architecture encourages using the best engine for the job. + +Let’s see how Lakehouses differ from data lakes. + +## Spatial tables in Lakehouses vs. Data Lakes + +Data Lakes store data in files without a metadata layer, so they don’t guarantee reliable transactions. + +Here are examples of spatial data lakes: + +* GeoParquet files stored in AWS S3 +* GeoJSON files stored in Azure Blob Storage +* CSV files with WKT geometry data stored in GCP + +Since data lakes do not support reliable transactions, they cannot support developer-friendly features like delete and merge, require downtime when datasets are mutated, and do not offer advanced performance features. Data lakes don’t support features like deletion vectors or small file compaction. + +The Lakehouse metadata layer allows for convenience functions and much better performance. + +The Lakehouse metadata layer is relatively small, so the storage costs for a Lakehouse and a data lake are about the same. Lakehouses allow for better performance, so compute expenses are generally lower than those of a data lake. + +## Spatial tables in Lakehouses vs. Data Warehouses + +A Data Warehouse is an analytics system powered by a proprietary engine and file formats. However, due to market pressures, this definition has shifted, and some Data Warehouses began supporting Lakehouse Storage Systems in addition to proprietary file formats. Many modern customers don’t want vendor lock-in via a proprietary file format. + +Data Warehouses have the following limitations: + +* They usually bundle storage with compute in pricing packages, so you must pay for more compute, even if you just want more storage. +* They store data in proprietary file formats, which aren’t compatible with other engines. +* Queries can run slower when you store data in open file formats. +* Sharing compute with other users can cause performance degradation when one user runs a large query. + +In modern times, the strict definition of a data warehouse is shifting because engines that used to support only proprietary file formats now support open file formats. You can now think of a Data Warehouse as a system that contains either a proprietary engine or a proprietary file format. + +Many modern enterprises prefer the Lakehouse architecture because it’s open, compatible with any engine that builds a connector, vendor-neutral, and low-cost. + +Let’s now see how to create some Iceberg tables. + +## Creating tabular tables with Iceberg + +Here’s how to create a `customers` table with `id` and `first_name` columns: + +```sql +CREATE TABLE local.db.customers (id string, first_name string) +USING iceberg +TBLPROPERTIES('format-version'='3'); +``` + +Let’s append some data to the table: + +```python +df = sedona.createDataFrame( + [ + ("a", "Bob"), + ("b", "Mary"), + ("c", "Sue"), + ], + ["id", "first_name"], +) + +df.write.format("iceberg").mode("append").saveAsTable("local.db.customers") +``` + +Let’s see how to run a query on the table: + +``` +sedona.table("local.db.customers").show() + ++---+----------+ +| id|first_name| ++---+----------+ +| a| Bob| +| b| Mary| +| c| Sue| ++---+----------+ +``` + +Creating a table with tabular data is straightforward. Now, let’s see how to make a table with spatial data in Iceberg. + +## Creating spatial tables with Iceberg v3 + +Let’s now create a `customer_purchases` table with a `purchase_location` column. + +Here’s how to create this spatial table with Iceberg: + +```sql +CREATE TABLE local.db.customer_purchases (id string, price double, geometry geometry) +USING iceberg +TBLPROPERTIES('format-version'='3'); +``` + +Now append some spatial data with longitude/latitude coordinates to the table: + +```python +coords = [ + (-88.110352, 24.006326), + (-77.080078, 24.006326), + (-77.080078, 31.503629), + (-88.110352, 31.503629), + (-88.110352, 24.006326), +] +df = sedona.createDataFrame( + [ + ("a", 10.99, Polygon(coords)), + ("b", 3.5, Point(1, 2)), + ("c", 1.95, Point(3, 4)), + ], + ["id", "price", "geometry"], +) + +df.write.format("iceberg").mode("append").saveAsTable("local.db.customer_purchases") +``` + +The spatial table contains points and polygons. Some purchases have an exact location, and others occur in a region. + +Let’s see how to join the tabular table with the spatial table. + +## Joining an Iceberg tabular table with a spatial table + +Here’s how to join the `customers` and `customer_purchases` tables. + +``` +customers = sedona.table("local.db.customers") +purchases = sedona.table("local.db.customer_purchases") + +joined = customers.join(purchases, "id") +joined.show() + ++---+----------+-----+--------------------+ +| id|first_name|price| geometry| ++---+----------+-----+--------------------+ +| a| Bob|10.99|POLYGON ((-88.110...| +| b| Mary| 3.5| POINT (1 2)| +| c| Sue| 1.95| POINT (3 4)| ++---+----------+-----+--------------------+ +``` + +Now we can see the customer information and the location of their purchases all in one table. + +Keep reading for an example with a spatial join, which is a join based on the geometry columns of two tables. + +It’s easy to join any tables with Sedona, regardless of the underlying file format, because Sedona has so many built-in file readers (e.g., you can easily join one table stored in Shapefiles and another stored in GeoParquet files). But it’s even easier when Iceberg stores the tabular and spatial tables in the same catalog. + +## Optimizing spatial tables in Lakehouses + +This section shows how queries can run faster on Iceberg tables. + +Let’s query the Overture Maps Foundation buildings dataset stored in GeoParquet files. + +```python +( + sedona.table("open_data.overture_2025_03_19_1.buildings_building") + .withColumn("geometry", ST_GeomFromWKB(col("geometry"))) + .select("id", "geometry", "num_floors", "roof_color") + .createOrReplaceTempView("my_fun_view") +) +``` + +Using this GeoParquet dataset, let’s run a filtering query to get the number of buildings in a small region near Gainesville, Florida. + +```python +spot = "POLYGON((-82.258759 29.129371, -82.180481 29.136569, -82.202454 29.173747, -82.258759 29.129371))" +sql = f""" +select * from my_fun_view +where ST_Contains(ST_GeomFromWKT('{spot}'), geometry) +""" +sedona.sql(sql).count() +``` + +This query runs in 45 seconds. + +Let’s convert this dataset to Iceberg: + +```python +df = sedona.table("open_data.overture_2025_03_19_1.buildings_building") + +sql = """ +CREATE TABLE local.db.overture_2025_03_19_1_buildings_building (id string, geometry geometry, num_floors integer, roof_color string) +USING iceberg +TBLPROPERTIES('format-version'='3'); +""" +sedona.sql(sql) + +( + df.select("id", "geometry", "num_floors", "roof_color") + .write.format("iceberg") + .mode("overwrite") + .saveAsTable("local.db.overture_2025_03_19_1_buildings_building") +) +``` + +Now let's rerun the same query on the Iceberg table: + +```python +spot = "POLYGON((-82.258759 29.129371, -82.180481 29.136569, -82.202454 29.173747, -82.258759 29.129371))" +sql = f""" +select * from local.db.overture_2025_03_19_1_buildings_building +where ST_Contains(ST_GeomFromWKT('{spot}'), geometry) +""" +sedona.sql(sql).count() +``` + +This query executes in 4 seconds. + +To make this Iceberg query run even faster, we could perform even more optimizations to colocate spatially near data in the same files. + +## More geospatial examples in Iceberg + +This section uses a contrived example to demonstrate some Iceberg features that help you manage your spatial data. Start by creating two tables. One for the blue geometric objects and another for the orange polygons in the following graph: + +{ align=center width="80%" } + +You can start by creating the Iceberg table: + +```sql +CREATE TABLE some_catalog.matt.icegeometries (id string, geometry geometry) +USING iceberg +TBLPROPERTIES('format-version'='3'); +``` + +Append objects `a`, `b, c, d, `and` e` to the table: Review Comment: Corrected spacing and formatting in the list of objects from '`b, c, d, `and` e`' to '`b`, `c`, `d`, and `e`'. ```suggestion Append objects `a`, `b`, `c`, `d`, and `e` to the table: ``` ########## docs/blog/posts/spatial-tables-lakehouse.md: ########## @@ -0,0 +1,528 @@ +--- +date: + created: 2025-10-21 +links: + - SedonaDB: https://sedona.apache.org/sedonadb/ + - SpatialBench: https://sedona.apache.org/spatialbench/ +authors: + - matt_powers +title: "Managing spatial tables in Data Lakehouses with Iceberg" +--- + +<!-- +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +--> + +This post explains the benefits of the Lakehouse Architecture for spatial tables and how Lakehouses differ from data warehouses and data lakes. + +<!-- more --> + +Many of the benefits that Lakehouses (e.g., Iceberg, Delta Lake, Hudi) provide for tabular data also apply to spatial data, for example: + +* Reliable transactions +* Versioned data +* Time travel +* Schema enforcement +* Optimizations +* And many more + +The spatial data community can use Lakehouses as of Iceberg v3, which adds geometry/geography types. + +Spatial data requires different types of metadata and optimizations, but doesn’t always require entirely different file formats. Iceberg can now store the metadata needed for geometry and geography columns. It’s excellent that you can now use Iceberg for tabular and spatial data. + +This post also explains why Lakehouse architecture is often better than warehouses and data lakes. Let’s start with a detailed description of the Lakehouse architecture. + +## Data Lakehouse Architecture Overview + +A Lakehouse stores data in a Lakehouse Storage System like Iceberg, Delta Lake, or Hudi. Iceberg v3 is currently the best option for spatial data because it natively supports geometry and geography columns. + +Tables in Lakehouses are governed by a catalog like Unity Catalog or Apache Polaris. The catalog allows role-based access control (RBAC) and features like multi-table transactions. + +You can query tables in the Lakehouse Architecture for BI, reporting, data science, machine learning, and other complex analyses. + +The following table illustrates the Lakehouse Architecture: + +{ align=center width="80%" } + +The Lakehouse Architecture offers several advantages: + +* Data is stored in open formats so any engine can query it, and there is no vendor lock-in. +* Lakehouses support all the features familiar to data warehouses, like reliable transactions, DML operations, and role-based access control. +* Lakehouses are often performant enough for low-latency applications like BI dashboards. +* Lakehouses are interoperable with proprietary tools like BigQuery, Redshift, or Esri. +* You can store Lakehouses in cloud-based storage systems without any additional charges. +* Lakehouses are compatible with any engine. You can use one engine for ingestion, another for ETL, and a third for ML. The architecture encourages using the best engine for the job. + +Let’s see how Lakehouses differ from data lakes. + +## Spatial tables in Lakehouses vs. Data Lakes + +Data Lakes store data in files without a metadata layer, so they don’t guarantee reliable transactions. + +Here are examples of spatial data lakes: + +* GeoParquet files stored in AWS S3 +* GeoJSON files stored in Azure Blob Storage +* CSV files with WKT geometry data stored in GCP + +Since data lakes do not support reliable transactions, they cannot support developer-friendly features like delete and merge, require downtime when datasets are mutated, and do not offer advanced performance features. Data lakes don’t support features like deletion vectors or small file compaction. + +The Lakehouse metadata layer allows for convenience functions and much better performance. + +The Lakehouse metadata layer is relatively small, so the storage costs for a Lakehouse and a data lake are about the same. Lakehouses allow for better performance, so compute expenses are generally lower than those of a data lake. + +## Spatial tables in Lakehouses vs. Data Warehouses + +A Data Warehouse is an analytics system powered by a proprietary engine and file formats. However, due to market pressures, this definition has shifted, and some Data Warehouses began supporting Lakehouse Storage Systems in addition to proprietary file formats. Many modern customers don’t want vendor lock-in via a proprietary file format. + +Data Warehouses have the following limitations: + +* They usually bundle storage with compute in pricing packages, so you must pay for more compute, even if you just want more storage. +* They store data in proprietary file formats, which aren’t compatible with other engines. +* Queries can run slower when you store data in open file formats. +* Sharing compute with other users can cause performance degradation when one user runs a large query. + +In modern times, the strict definition of a data warehouse is shifting because engines that used to support only proprietary file formats now support open file formats. You can now think of a Data Warehouse as a system that contains either a proprietary engine or a proprietary file format. + +Many modern enterprises prefer the Lakehouse architecture because it’s open, compatible with any engine that builds a connector, vendor-neutral, and low-cost. + +Let’s now see how to create some Iceberg tables. + +## Creating tabular tables with Iceberg + +Here’s how to create a `customers` table with `id` and `first_name` columns: + +```sql +CREATE TABLE local.db.customers (id string, first_name string) +USING iceberg +TBLPROPERTIES('format-version'='3'); +``` + +Let’s append some data to the table: + +```python +df = sedona.createDataFrame( + [ + ("a", "Bob"), + ("b", "Mary"), + ("c", "Sue"), + ], + ["id", "first_name"], +) + +df.write.format("iceberg").mode("append").saveAsTable("local.db.customers") +``` + +Let’s see how to run a query on the table: + +``` +sedona.table("local.db.customers").show() + ++---+----------+ +| id|first_name| ++---+----------+ +| a| Bob| +| b| Mary| +| c| Sue| ++---+----------+ +``` + +Creating a table with tabular data is straightforward. Now, let’s see how to make a table with spatial data in Iceberg. + +## Creating spatial tables with Iceberg v3 + +Let’s now create a `customer_purchases` table with a `purchase_location` column. + +Here’s how to create this spatial table with Iceberg: + +```sql +CREATE TABLE local.db.customer_purchases (id string, price double, geometry geometry) +USING iceberg +TBLPROPERTIES('format-version'='3'); +``` + +Now append some spatial data with longitude/latitude coordinates to the table: + +```python +coords = [ + (-88.110352, 24.006326), + (-77.080078, 24.006326), + (-77.080078, 31.503629), + (-88.110352, 31.503629), + (-88.110352, 24.006326), +] +df = sedona.createDataFrame( + [ + ("a", 10.99, Polygon(coords)), + ("b", 3.5, Point(1, 2)), + ("c", 1.95, Point(3, 4)), + ], + ["id", "price", "geometry"], +) + +df.write.format("iceberg").mode("append").saveAsTable("local.db.customer_purchases") +``` + +The spatial table contains points and polygons. Some purchases have an exact location, and others occur in a region. + +Let’s see how to join the tabular table with the spatial table. + +## Joining an Iceberg tabular table with a spatial table + +Here’s how to join the `customers` and `customer_purchases` tables. + +``` +customers = sedona.table("local.db.customers") +purchases = sedona.table("local.db.customer_purchases") + +joined = customers.join(purchases, "id") +joined.show() + ++---+----------+-----+--------------------+ +| id|first_name|price| geometry| ++---+----------+-----+--------------------+ +| a| Bob|10.99|POLYGON ((-88.110...| +| b| Mary| 3.5| POINT (1 2)| +| c| Sue| 1.95| POINT (3 4)| ++---+----------+-----+--------------------+ +``` + +Now we can see the customer information and the location of their purchases all in one table. + +Keep reading for an example with a spatial join, which is a join based on the geometry columns of two tables. + +It’s easy to join any tables with Sedona, regardless of the underlying file format, because Sedona has so many built-in file readers (e.g., you can easily join one table stored in Shapefiles and another stored in GeoParquet files). But it’s even easier when Iceberg stores the tabular and spatial tables in the same catalog. + +## Optimizing spatial tables in Lakehouses + +This section shows how queries can run faster on Iceberg tables. + +Let’s query the Overture Maps Foundation buildings dataset stored in GeoParquet files. + +```python +( + sedona.table("open_data.overture_2025_03_19_1.buildings_building") + .withColumn("geometry", ST_GeomFromWKB(col("geometry"))) + .select("id", "geometry", "num_floors", "roof_color") + .createOrReplaceTempView("my_fun_view") +) +``` + +Using this GeoParquet dataset, let’s run a filtering query to get the number of buildings in a small region near Gainesville, Florida. + +```python +spot = "POLYGON((-82.258759 29.129371, -82.180481 29.136569, -82.202454 29.173747, -82.258759 29.129371))" +sql = f""" +select * from my_fun_view +where ST_Contains(ST_GeomFromWKT('{spot}'), geometry) +""" +sedona.sql(sql).count() +``` + +This query runs in 45 seconds. + +Let’s convert this dataset to Iceberg: + +```python +df = sedona.table("open_data.overture_2025_03_19_1.buildings_building") + +sql = """ +CREATE TABLE local.db.overture_2025_03_19_1_buildings_building (id string, geometry geometry, num_floors integer, roof_color string) +USING iceberg +TBLPROPERTIES('format-version'='3'); +""" +sedona.sql(sql) + +( + df.select("id", "geometry", "num_floors", "roof_color") + .write.format("iceberg") + .mode("overwrite") + .saveAsTable("local.db.overture_2025_03_19_1_buildings_building") +) +``` + +Now let's rerun the same query on the Iceberg table: + +```python +spot = "POLYGON((-82.258759 29.129371, -82.180481 29.136569, -82.202454 29.173747, -82.258759 29.129371))" +sql = f""" +select * from local.db.overture_2025_03_19_1_buildings_building +where ST_Contains(ST_GeomFromWKT('{spot}'), geometry) +""" +sedona.sql(sql).count() +``` + +This query executes in 4 seconds. + +To make this Iceberg query run even faster, we could perform even more optimizations to colocate spatially near data in the same files. + +## More geospatial examples in Iceberg + +This section uses a contrived example to demonstrate some Iceberg features that help you manage your spatial data. Start by creating two tables. One for the blue geometric objects and another for the orange polygons in the following graph: + +{ align=center width="80%" } + +You can start by creating the Iceberg table: + +```sql +CREATE TABLE some_catalog.matt.icegeometries (id string, geometry geometry) +USING iceberg +TBLPROPERTIES('format-version'='3'); +``` + +Append objects `a`, `b, c, d, `and` e` to the table: + +```python +df = sedona.createDataFrame( + [ + ("a", "LINESTRING(1.0 3.0,3.0 1.0)"), + ("b", "LINESTRING(2.0 5.0,6.0 1.0)"), + ("c", "POLYGON((7.0 4.0,9.0 4.0,9.0 3.0,7.0 3.0,7.0 4.0))"), + ("d", "LINESTRING(2.0 7.0,4.0 9.0,7.0 8.0)"), + ("e", "LINESTRING(10.0 9.0,10.0 6.0)"), + ], + ["id", "geometry"], +) +df = df.withColumn("geometry", ST_GeomFromText(col("geometry"))) + +df.write.format("iceberg").mode("append").saveAsTable("some_catalog.matt.icegeometries") +``` + +Check the content of the table: + +``` +sedona.sql("SELECT * FROM some_catalog.matt.icegeometries;").show(truncate=False) + ++---+-----------------------------------+ +|id |geometry | ++---+-----------------------------------+ +|a |LINESTRING (1 3, 3 1) | +|b |LINESTRING (2 5, 6 1) | +|c |POLYGON ((7 4, 9 4, 9 3, 7 3, 7 4))| +|d |LINESTRING (2 7, 4 9, 7 8) | +|e |LINESTRING (10 9, 10 6) | ++---+-----------------------------------+ +``` + +Iceberg makes it easy to delete rows of data in a table based on a predicate. + +Now create a table with the polygons. Start by creating the Iceberg table: + +```sql +CREATE TABLE some_catalog.matt.icepolygons (id string, geometry geometry) +USING iceberg +TBLPROPERTIES('format-version'='3'); +``` + +Append objects `polygon_x `and `polygon_y` to the table: Review Comment: Removed extra space after 'polygon_x' - should be '`polygon_x` and `polygon_y`'. ```suggestion Append objects `polygon_x` and `polygon_y` to the table: ``` ########## docs/blog/posts/spatial-tables-lakehouse.md: ########## @@ -0,0 +1,528 @@ +--- +date: + created: 2025-10-21 +links: + - SedonaDB: https://sedona.apache.org/sedonadb/ + - SpatialBench: https://sedona.apache.org/spatialbench/ +authors: + - matt_powers +title: "Managing spatial tables in Data Lakehouses with Iceberg" +--- + +<!-- +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +--> + +This post explains the benefits of the Lakehouse Architecture for spatial tables and how Lakehouses differ from data warehouses and data lakes. + +<!-- more --> + +Many of the benefits that Lakehouses (e.g., Iceberg, Delta Lake, Hudi) provide for tabular data also apply to spatial data, for example: + +* Reliable transactions +* Versioned data +* Time travel +* Schema enforcement +* Optimizations +* And many more + +The spatial data community can use Lakehouses as of Iceberg v3, which adds geometry/geography types. + +Spatial data requires different types of metadata and optimizations, but doesn’t always require entirely different file formats. Iceberg can now store the metadata needed for geometry and geography columns. It’s excellent that you can now use Iceberg for tabular and spatial data. + +This post also explains why Lakehouse architecture is often better than warehouses and data lakes. Let’s start with a detailed description of the Lakehouse architecture. + +## Data Lakehouse Architecture Overview + +A Lakehouse stores data in a Lakehouse Storage System like Iceberg, Delta Lake, or Hudi. Iceberg v3 is currently the best option for spatial data because it natively supports geometry and geography columns. + +Tables in Lakehouses are governed by a catalog like Unity Catalog or Apache Polaris. The catalog allows role-based access control (RBAC) and features like multi-table transactions. + +You can query tables in the Lakehouse Architecture for BI, reporting, data science, machine learning, and other complex analyses. + +The following table illustrates the Lakehouse Architecture: + +{ align=center width="80%" } + +The Lakehouse Architecture offers several advantages: + +* Data is stored in open formats so any engine can query it, and there is no vendor lock-in. +* Lakehouses support all the features familiar to data warehouses, like reliable transactions, DML operations, and role-based access control. +* Lakehouses are often performant enough for low-latency applications like BI dashboards. +* Lakehouses are interoperable with proprietary tools like BigQuery, Redshift, or Esri. +* You can store Lakehouses in cloud-based storage systems without any additional charges. +* Lakehouses are compatible with any engine. You can use one engine for ingestion, another for ETL, and a third for ML. The architecture encourages using the best engine for the job. + +Let’s see how Lakehouses differ from data lakes. + +## Spatial tables in Lakehouses vs. Data Lakes + +Data Lakes store data in files without a metadata layer, so they don’t guarantee reliable transactions. + +Here are examples of spatial data lakes: + +* GeoParquet files stored in AWS S3 +* GeoJSON files stored in Azure Blob Storage +* CSV files with WKT geometry data stored in GCP + +Since data lakes do not support reliable transactions, they cannot support developer-friendly features like delete and merge, require downtime when datasets are mutated, and do not offer advanced performance features. Data lakes don’t support features like deletion vectors or small file compaction. + +The Lakehouse metadata layer allows for convenience functions and much better performance. + +The Lakehouse metadata layer is relatively small, so the storage costs for a Lakehouse and a data lake are about the same. Lakehouses allow for better performance, so compute expenses are generally lower than those of a data lake. + +## Spatial tables in Lakehouses vs. Data Warehouses + +A Data Warehouse is an analytics system powered by a proprietary engine and file formats. However, due to market pressures, this definition has shifted, and some Data Warehouses began supporting Lakehouse Storage Systems in addition to proprietary file formats. Many modern customers don’t want vendor lock-in via a proprietary file format. + +Data Warehouses have the following limitations: + +* They usually bundle storage with compute in pricing packages, so you must pay for more compute, even if you just want more storage. +* They store data in proprietary file formats, which aren’t compatible with other engines. +* Queries can run slower when you store data in open file formats. +* Sharing compute with other users can cause performance degradation when one user runs a large query. + +In modern times, the strict definition of a data warehouse is shifting because engines that used to support only proprietary file formats now support open file formats. You can now think of a Data Warehouse as a system that contains either a proprietary engine or a proprietary file format. + +Many modern enterprises prefer the Lakehouse architecture because it’s open, compatible with any engine that builds a connector, vendor-neutral, and low-cost. + +Let’s now see how to create some Iceberg tables. + +## Creating tabular tables with Iceberg + +Here’s how to create a `customers` table with `id` and `first_name` columns: + +```sql +CREATE TABLE local.db.customers (id string, first_name string) +USING iceberg +TBLPROPERTIES('format-version'='3'); +``` + +Let’s append some data to the table: + +```python +df = sedona.createDataFrame( + [ + ("a", "Bob"), + ("b", "Mary"), + ("c", "Sue"), + ], + ["id", "first_name"], +) + +df.write.format("iceberg").mode("append").saveAsTable("local.db.customers") +``` + +Let’s see how to run a query on the table: + +``` +sedona.table("local.db.customers").show() + ++---+----------+ +| id|first_name| ++---+----------+ +| a| Bob| +| b| Mary| +| c| Sue| ++---+----------+ +``` + +Creating a table with tabular data is straightforward. Now, let’s see how to make a table with spatial data in Iceberg. + +## Creating spatial tables with Iceberg v3 + +Let’s now create a `customer_purchases` table with a `purchase_location` column. + +Here’s how to create this spatial table with Iceberg: + +```sql +CREATE TABLE local.db.customer_purchases (id string, price double, geometry geometry) +USING iceberg +TBLPROPERTIES('format-version'='3'); +``` + +Now append some spatial data with longitude/latitude coordinates to the table: + +```python +coords = [ + (-88.110352, 24.006326), + (-77.080078, 24.006326), + (-77.080078, 31.503629), + (-88.110352, 31.503629), + (-88.110352, 24.006326), +] +df = sedona.createDataFrame( + [ + ("a", 10.99, Polygon(coords)), + ("b", 3.5, Point(1, 2)), + ("c", 1.95, Point(3, 4)), + ], + ["id", "price", "geometry"], +) + +df.write.format("iceberg").mode("append").saveAsTable("local.db.customer_purchases") +``` + +The spatial table contains points and polygons. Some purchases have an exact location, and others occur in a region. + +Let’s see how to join the tabular table with the spatial table. + +## Joining an Iceberg tabular table with a spatial table + +Here’s how to join the `customers` and `customer_purchases` tables. + +``` +customers = sedona.table("local.db.customers") +purchases = sedona.table("local.db.customer_purchases") + +joined = customers.join(purchases, "id") +joined.show() + ++---+----------+-----+--------------------+ +| id|first_name|price| geometry| ++---+----------+-----+--------------------+ +| a| Bob|10.99|POLYGON ((-88.110...| +| b| Mary| 3.5| POINT (1 2)| +| c| Sue| 1.95| POINT (3 4)| ++---+----------+-----+--------------------+ +``` + +Now we can see the customer information and the location of their purchases all in one table. + +Keep reading for an example with a spatial join, which is a join based on the geometry columns of two tables. + +It’s easy to join any tables with Sedona, regardless of the underlying file format, because Sedona has so many built-in file readers (e.g., you can easily join one table stored in Shapefiles and another stored in GeoParquet files). But it’s even easier when Iceberg stores the tabular and spatial tables in the same catalog. + +## Optimizing spatial tables in Lakehouses + +This section shows how queries can run faster on Iceberg tables. + +Let’s query the Overture Maps Foundation buildings dataset stored in GeoParquet files. + +```python +( + sedona.table("open_data.overture_2025_03_19_1.buildings_building") + .withColumn("geometry", ST_GeomFromWKB(col("geometry"))) + .select("id", "geometry", "num_floors", "roof_color") + .createOrReplaceTempView("my_fun_view") +) +``` + +Using this GeoParquet dataset, let’s run a filtering query to get the number of buildings in a small region near Gainesville, Florida. + +```python +spot = "POLYGON((-82.258759 29.129371, -82.180481 29.136569, -82.202454 29.173747, -82.258759 29.129371))" +sql = f""" +select * from my_fun_view +where ST_Contains(ST_GeomFromWKT('{spot}'), geometry) +""" +sedona.sql(sql).count() +``` + +This query runs in 45 seconds. + +Let’s convert this dataset to Iceberg: + +```python +df = sedona.table("open_data.overture_2025_03_19_1.buildings_building") + +sql = """ +CREATE TABLE local.db.overture_2025_03_19_1_buildings_building (id string, geometry geometry, num_floors integer, roof_color string) +USING iceberg +TBLPROPERTIES('format-version'='3'); +""" +sedona.sql(sql) + +( + df.select("id", "geometry", "num_floors", "roof_color") + .write.format("iceberg") + .mode("overwrite") + .saveAsTable("local.db.overture_2025_03_19_1_buildings_building") +) +``` + +Now let's rerun the same query on the Iceberg table: + +```python +spot = "POLYGON((-82.258759 29.129371, -82.180481 29.136569, -82.202454 29.173747, -82.258759 29.129371))" +sql = f""" +select * from local.db.overture_2025_03_19_1_buildings_building +where ST_Contains(ST_GeomFromWKT('{spot}'), geometry) +""" +sedona.sql(sql).count() +``` + +This query executes in 4 seconds. + +To make this Iceberg query run even faster, we could perform even more optimizations to colocate spatially near data in the same files. + +## More geospatial examples in Iceberg + +This section uses a contrived example to demonstrate some Iceberg features that help you manage your spatial data. Start by creating two tables. One for the blue geometric objects and another for the orange polygons in the following graph: + +{ align=center width="80%" } + +You can start by creating the Iceberg table: + +```sql +CREATE TABLE some_catalog.matt.icegeometries (id string, geometry geometry) +USING iceberg +TBLPROPERTIES('format-version'='3'); +``` + +Append objects `a`, `b, c, d, `and` e` to the table: + +```python +df = sedona.createDataFrame( + [ + ("a", "LINESTRING(1.0 3.0,3.0 1.0)"), + ("b", "LINESTRING(2.0 5.0,6.0 1.0)"), + ("c", "POLYGON((7.0 4.0,9.0 4.0,9.0 3.0,7.0 3.0,7.0 4.0))"), + ("d", "LINESTRING(2.0 7.0,4.0 9.0,7.0 8.0)"), + ("e", "LINESTRING(10.0 9.0,10.0 6.0)"), + ], + ["id", "geometry"], +) +df = df.withColumn("geometry", ST_GeomFromText(col("geometry"))) + +df.write.format("iceberg").mode("append").saveAsTable("some_catalog.matt.icegeometries") +``` + +Check the content of the table: + +``` +sedona.sql("SELECT * FROM some_catalog.matt.icegeometries;").show(truncate=False) + ++---+-----------------------------------+ +|id |geometry | ++---+-----------------------------------+ +|a |LINESTRING (1 3, 3 1) | +|b |LINESTRING (2 5, 6 1) | +|c |POLYGON ((7 4, 9 4, 9 3, 7 3, 7 4))| +|d |LINESTRING (2 7, 4 9, 7 8) | +|e |LINESTRING (10 9, 10 6) | ++---+-----------------------------------+ +``` + +Iceberg makes it easy to delete rows of data in a table based on a predicate. + +Now create a table with the polygons. Start by creating the Iceberg table: + +```sql +CREATE TABLE some_catalog.matt.icepolygons (id string, geometry geometry) +USING iceberg +TBLPROPERTIES('format-version'='3'); +``` + +Append objects `polygon_x `and `polygon_y` to the table: + +```python +df = sedona.createDataFrame( + [ + ("polygon_x", "POLYGON((3.0 5.0,8.0 5.0,8.0 2.0,3.0 2.0,3.0 5.0))"), + ("polygon_y", "POLYGON((5.0 9.0,8.0 9.0,8.0 7.0,5.0 7.0,5.0 9.0))"), + ], + ["id", "geometry"], +) +df = df.withColumn("geometry", ST_GeomFromText(col("geometry"))) + +df.write.format("iceberg").mode("append").saveAsTable("some_catalog.matt.icegeometries") Review Comment: The code is appending polygon data to the 'icegeometries' table instead of the 'icepolygons' table. This should use 'some_catalog.matt.icepolygons' as the target table. ```suggestion df.write.format("iceberg").mode("append").saveAsTable("some_catalog.matt.icepolygons") ``` ########## docs/blog/posts/spatial-tables-lakehouse.md: ########## @@ -0,0 +1,528 @@ +--- +date: + created: 2025-10-21 +links: + - SedonaDB: https://sedona.apache.org/sedonadb/ + - SpatialBench: https://sedona.apache.org/spatialbench/ +authors: + - matt_powers +title: "Managing spatial tables in Data Lakehouses with Iceberg" +--- + +<!-- +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +--> + +This post explains the benefits of the Lakehouse Architecture for spatial tables and how Lakehouses differ from data warehouses and data lakes. + +<!-- more --> + +Many of the benefits that Lakehouses (e.g., Iceberg, Delta Lake, Hudi) provide for tabular data also apply to spatial data, for example: + +* Reliable transactions +* Versioned data +* Time travel +* Schema enforcement +* Optimizations +* And many more + +The spatial data community can use Lakehouses as of Iceberg v3, which adds geometry/geography types. + +Spatial data requires different types of metadata and optimizations, but doesn’t always require entirely different file formats. Iceberg can now store the metadata needed for geometry and geography columns. It’s excellent that you can now use Iceberg for tabular and spatial data. + +This post also explains why Lakehouse architecture is often better than warehouses and data lakes. Let’s start with a detailed description of the Lakehouse architecture. + +## Data Lakehouse Architecture Overview + +A Lakehouse stores data in a Lakehouse Storage System like Iceberg, Delta Lake, or Hudi. Iceberg v3 is currently the best option for spatial data because it natively supports geometry and geography columns. + +Tables in Lakehouses are governed by a catalog like Unity Catalog or Apache Polaris. The catalog allows role-based access control (RBAC) and features like multi-table transactions. + +You can query tables in the Lakehouse Architecture for BI, reporting, data science, machine learning, and other complex analyses. + +The following table illustrates the Lakehouse Architecture: + +{ align=center width="80%" } + +The Lakehouse Architecture offers several advantages: + +* Data is stored in open formats so any engine can query it, and there is no vendor lock-in. +* Lakehouses support all the features familiar to data warehouses, like reliable transactions, DML operations, and role-based access control. +* Lakehouses are often performant enough for low-latency applications like BI dashboards. +* Lakehouses are interoperable with proprietary tools like BigQuery, Redshift, or Esri. +* You can store Lakehouses in cloud-based storage systems without any additional charges. +* Lakehouses are compatible with any engine. You can use one engine for ingestion, another for ETL, and a third for ML. The architecture encourages using the best engine for the job. + +Let’s see how Lakehouses differ from data lakes. + +## Spatial tables in Lakehouses vs. Data Lakes + +Data Lakes store data in files without a metadata layer, so they don’t guarantee reliable transactions. + +Here are examples of spatial data lakes: + +* GeoParquet files stored in AWS S3 +* GeoJSON files stored in Azure Blob Storage +* CSV files with WKT geometry data stored in GCP + +Since data lakes do not support reliable transactions, they cannot support developer-friendly features like delete and merge, require downtime when datasets are mutated, and do not offer advanced performance features. Data lakes don’t support features like deletion vectors or small file compaction. + +The Lakehouse metadata layer allows for convenience functions and much better performance. + +The Lakehouse metadata layer is relatively small, so the storage costs for a Lakehouse and a data lake are about the same. Lakehouses allow for better performance, so compute expenses are generally lower than those of a data lake. + +## Spatial tables in Lakehouses vs. Data Warehouses + +A Data Warehouse is an analytics system powered by a proprietary engine and file formats. However, due to market pressures, this definition has shifted, and some Data Warehouses began supporting Lakehouse Storage Systems in addition to proprietary file formats. Many modern customers don’t want vendor lock-in via a proprietary file format. + +Data Warehouses have the following limitations: + +* They usually bundle storage with compute in pricing packages, so you must pay for more compute, even if you just want more storage. +* They store data in proprietary file formats, which aren’t compatible with other engines. +* Queries can run slower when you store data in open file formats. +* Sharing compute with other users can cause performance degradation when one user runs a large query. + +In modern times, the strict definition of a data warehouse is shifting because engines that used to support only proprietary file formats now support open file formats. You can now think of a Data Warehouse as a system that contains either a proprietary engine or a proprietary file format. + +Many modern enterprises prefer the Lakehouse architecture because it’s open, compatible with any engine that builds a connector, vendor-neutral, and low-cost. + +Let’s now see how to create some Iceberg tables. + +## Creating tabular tables with Iceberg + +Here’s how to create a `customers` table with `id` and `first_name` columns: + +```sql +CREATE TABLE local.db.customers (id string, first_name string) +USING iceberg +TBLPROPERTIES('format-version'='3'); +``` + +Let’s append some data to the table: + +```python +df = sedona.createDataFrame( + [ + ("a", "Bob"), + ("b", "Mary"), + ("c", "Sue"), + ], + ["id", "first_name"], +) + +df.write.format("iceberg").mode("append").saveAsTable("local.db.customers") +``` + +Let’s see how to run a query on the table: + +``` +sedona.table("local.db.customers").show() + ++---+----------+ +| id|first_name| ++---+----------+ +| a| Bob| +| b| Mary| +| c| Sue| ++---+----------+ +``` + +Creating a table with tabular data is straightforward. Now, let’s see how to make a table with spatial data in Iceberg. + +## Creating spatial tables with Iceberg v3 + +Let’s now create a `customer_purchases` table with a `purchase_location` column. + +Here’s how to create this spatial table with Iceberg: + +```sql +CREATE TABLE local.db.customer_purchases (id string, price double, geometry geometry) +USING iceberg +TBLPROPERTIES('format-version'='3'); +``` + +Now append some spatial data with longitude/latitude coordinates to the table: + +```python +coords = [ + (-88.110352, 24.006326), + (-77.080078, 24.006326), + (-77.080078, 31.503629), + (-88.110352, 31.503629), + (-88.110352, 24.006326), +] +df = sedona.createDataFrame( + [ + ("a", 10.99, Polygon(coords)), + ("b", 3.5, Point(1, 2)), + ("c", 1.95, Point(3, 4)), + ], + ["id", "price", "geometry"], +) + +df.write.format("iceberg").mode("append").saveAsTable("local.db.customer_purchases") +``` + +The spatial table contains points and polygons. Some purchases have an exact location, and others occur in a region. + +Let’s see how to join the tabular table with the spatial table. + +## Joining an Iceberg tabular table with a spatial table + +Here’s how to join the `customers` and `customer_purchases` tables. + +``` +customers = sedona.table("local.db.customers") +purchases = sedona.table("local.db.customer_purchases") + +joined = customers.join(purchases, "id") +joined.show() + ++---+----------+-----+--------------------+ +| id|first_name|price| geometry| ++---+----------+-----+--------------------+ +| a| Bob|10.99|POLYGON ((-88.110...| +| b| Mary| 3.5| POINT (1 2)| +| c| Sue| 1.95| POINT (3 4)| ++---+----------+-----+--------------------+ +``` + +Now we can see the customer information and the location of their purchases all in one table. + +Keep reading for an example with a spatial join, which is a join based on the geometry columns of two tables. + +It’s easy to join any tables with Sedona, regardless of the underlying file format, because Sedona has so many built-in file readers (e.g., you can easily join one table stored in Shapefiles and another stored in GeoParquet files). But it’s even easier when Iceberg stores the tabular and spatial tables in the same catalog. + +## Optimizing spatial tables in Lakehouses + +This section shows how queries can run faster on Iceberg tables. + +Let’s query the Overture Maps Foundation buildings dataset stored in GeoParquet files. + +```python +( + sedona.table("open_data.overture_2025_03_19_1.buildings_building") + .withColumn("geometry", ST_GeomFromWKB(col("geometry"))) + .select("id", "geometry", "num_floors", "roof_color") + .createOrReplaceTempView("my_fun_view") +) +``` + +Using this GeoParquet dataset, let’s run a filtering query to get the number of buildings in a small region near Gainesville, Florida. + +```python +spot = "POLYGON((-82.258759 29.129371, -82.180481 29.136569, -82.202454 29.173747, -82.258759 29.129371))" +sql = f""" +select * from my_fun_view +where ST_Contains(ST_GeomFromWKT('{spot}'), geometry) +""" +sedona.sql(sql).count() +``` + +This query runs in 45 seconds. + +Let’s convert this dataset to Iceberg: + +```python +df = sedona.table("open_data.overture_2025_03_19_1.buildings_building") + +sql = """ +CREATE TABLE local.db.overture_2025_03_19_1_buildings_building (id string, geometry geometry, num_floors integer, roof_color string) +USING iceberg +TBLPROPERTIES('format-version'='3'); +""" +sedona.sql(sql) + +( + df.select("id", "geometry", "num_floors", "roof_color") + .write.format("iceberg") + .mode("overwrite") + .saveAsTable("local.db.overture_2025_03_19_1_buildings_building") +) +``` + +Now let's rerun the same query on the Iceberg table: + +```python +spot = "POLYGON((-82.258759 29.129371, -82.180481 29.136569, -82.202454 29.173747, -82.258759 29.129371))" +sql = f""" +select * from local.db.overture_2025_03_19_1_buildings_building +where ST_Contains(ST_GeomFromWKT('{spot}'), geometry) +""" +sedona.sql(sql).count() +``` + +This query executes in 4 seconds. + +To make this Iceberg query run even faster, we could perform even more optimizations to colocate spatially near data in the same files. + +## More geospatial examples in Iceberg + +This section uses a contrived example to demonstrate some Iceberg features that help you manage your spatial data. Start by creating two tables. One for the blue geometric objects and another for the orange polygons in the following graph: + +{ align=center width="80%" } + +You can start by creating the Iceberg table: + +```sql +CREATE TABLE some_catalog.matt.icegeometries (id string, geometry geometry) +USING iceberg +TBLPROPERTIES('format-version'='3'); +``` + +Append objects `a`, `b, c, d, `and` e` to the table: + +```python +df = sedona.createDataFrame( + [ + ("a", "LINESTRING(1.0 3.0,3.0 1.0)"), + ("b", "LINESTRING(2.0 5.0,6.0 1.0)"), + ("c", "POLYGON((7.0 4.0,9.0 4.0,9.0 3.0,7.0 3.0,7.0 4.0))"), + ("d", "LINESTRING(2.0 7.0,4.0 9.0,7.0 8.0)"), + ("e", "LINESTRING(10.0 9.0,10.0 6.0)"), + ], + ["id", "geometry"], +) +df = df.withColumn("geometry", ST_GeomFromText(col("geometry"))) + +df.write.format("iceberg").mode("append").saveAsTable("some_catalog.matt.icegeometries") +``` + +Check the content of the table: + +``` +sedona.sql("SELECT * FROM some_catalog.matt.icegeometries;").show(truncate=False) + ++---+-----------------------------------+ +|id |geometry | ++---+-----------------------------------+ +|a |LINESTRING (1 3, 3 1) | +|b |LINESTRING (2 5, 6 1) | +|c |POLYGON ((7 4, 9 4, 9 3, 7 3, 7 4))| +|d |LINESTRING (2 7, 4 9, 7 8) | +|e |LINESTRING (10 9, 10 6) | ++---+-----------------------------------+ +``` + +Iceberg makes it easy to delete rows of data in a table based on a predicate. + +Now create a table with the polygons. Start by creating the Iceberg table: + +```sql +CREATE TABLE some_catalog.matt.icepolygons (id string, geometry geometry) +USING iceberg +TBLPROPERTIES('format-version'='3'); +``` + +Append objects `polygon_x `and `polygon_y` to the table: + +```python +df = sedona.createDataFrame( + [ + ("polygon_x", "POLYGON((3.0 5.0,8.0 5.0,8.0 2.0,3.0 2.0,3.0 5.0))"), + ("polygon_y", "POLYGON((5.0 9.0,8.0 9.0,8.0 7.0,5.0 7.0,5.0 9.0))"), + ], + ["id", "geometry"], +) +df = df.withColumn("geometry", ST_GeomFromText(col("geometry"))) + +df.write.format("iceberg").mode("append").saveAsTable("some_catalog.matt.icegeometries") +``` + +Here’s how you can delete all the linestrings that cross any polygon. + +```python +sql = f""" +DELETE FROM some_catalog.matt.icegeometries +WHERE ST_Intersects(geometry, ST_GeomFromWKT('{polygon}')) +""" +sedona.sql(sql) +``` + +Check the table to see that geometries `b`, `c,` and `d` are deleted from the table. Review Comment: Inconsistent spacing in backticks - '`c,`' should be '`c`,' with the comma outside the backticks. ```suggestion Check the table to see that geometries `b`, `c`, and `d` are deleted from the table. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
