Re: [PR] [DOCS] fixing read parquet files page [sedona-db]

via GitHub Thu, 18 Sep 2025 19:25:06 -0700


paleolimbot commented on code in PR #110:
URL: https://github.com/apache/sedona-db/pull/110#discussion_r2361479840



##########
docs/working-with-parquet-files.md:
##########
@@ -0,0 +1,80 @@
+# Working with Parquet Files
+
+To read a GeoPaquet or Parquet file, you must use the dedicated 
`sd.read_parquet()` method. You cannot query a file path directly within the 
`sd.sql()` `FROM` clause.
+
+The `sd.sql()` function is designed to query tables that have already been 
registered in the session.

Review Comment:
   Putting them in the FROM clause should work! How about:
   
   ```
   The easiest way to read a GeoParquet or Parquet file is to use 
`sd.read_parquet()`; however,
   you can also query GeoParquet or Parquet files by referring to their path 
from SQL.
   ```



##########
docs/working-with-parquet-files.md:
##########
@@ -0,0 +1,80 @@
+# Working with Parquet Files
+
+To read a GeoPaquet or Parquet file, you must use the dedicated 
`sd.read_parquet()` method. You cannot query a file path directly within the 
`sd.sql()` `FROM` clause.
+
+The `sd.sql()` function is designed to query tables that have already been 
registered in the session.
+
+## Install SedonaDB
+
+Use pip to install SedonaDB from the Python Package Index (PyPI).
+
+
+```python
+%pip install "apache-sedona[db]"
+```
+
+## Implementation
+
+To read a geoparquet or parquet file with SedonaDB, you must:
+
+1. **Load** the Parquet file into a data frame using `sd.read_parquet()`.
+2. **Register** the data frame as a view with `to_view()`.
+3. **Query** the view using `sd.sql()`.
+4. **Write** to a Parquet file with `sd.to_parquet()`.
+
+
+```python
+# Import the sedona.db module and connect to SedonaDB
+import sedona.db
+sd = sedona.db.connect()
+```
+
+
+```python
+
+# 1. Load the Parquet file
+df = sd.read_parquet(
+    'https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0/'
+    'natural-earth/files/natural-earth_cities_geo.parquet'
+)
+
+# 2. Register the data frame as a view
+df.to_view("zone")
+
+# 3. Query the view and store the result in a new DataFrame
+query_result_df = sd.sql("SELECT * FROM zone LIMIT 10")
+query_result_df.show()
+```
+
+
+```python
+
+# 4. Write the result to a new Parquet file
+output_path = "query_results.parquet"
+query_result_df.to_parquet(output_path)
+
+# (Optional) Verify the written file
+print(f"\nVerifying the written file at '{output_path}'...")
+verified_df = sd.read_parquet(output_path)
+verified_df.show(5)

Review Comment:
   I'm not sure these lines are adding to the tutorial



##########
requirements-docs.txt:
##########


Review Comment:
   I believe this will break CI (and the code you've documented above)



##########
docs/index.md:
##########
@@ -24,30 +23,45 @@ title: Introducing SedonaDB
   under the License.
 -->
 
-SedonaDB is a high-performance, dependency-free geospatial compute engine 
designed for single-node processing, making it ideal for smaller datasets on 
local machines or cloud instances.
+SedonaDB is a single-node analytical database engine with geospatial as the 
first-class citizen.
+
+Highly performant and dependency-free, SedonaDB is ideal for working with 
smaller datasets located on local machines or cloud instances.

Review Comment:
   ```suggestion
   Fast and dependency-free, SedonaDB is ideal for working with smaller 
datasets located on local machines or cloud instances.
   ```



##########
docs/working-with-parquet-files.md:
##########
@@ -0,0 +1,80 @@
+# Working with Parquet Files
+
+To read a GeoPaquet or Parquet file, you must use the dedicated 
`sd.read_parquet()` method. You cannot query a file path directly within the 
`sd.sql()` `FROM` clause.
+
+The `sd.sql()` function is designed to query tables that have already been 
registered in the session.
+
+## Install SedonaDB
+
+Use pip to install SedonaDB from the Python Package Index (PyPI).
+
+
+```python
+%pip install "apache-sedona[db]"
+```

Review Comment:
   Does this need to be replicated?



##########
docs/working-with-parquet-files.md:
##########
@@ -0,0 +1,80 @@
+# Working with Parquet Files
+
+To read a GeoPaquet or Parquet file, you must use the dedicated 
`sd.read_parquet()` method. You cannot query a file path directly within the 
`sd.sql()` `FROM` clause.
+
+The `sd.sql()` function is designed to query tables that have already been 
registered in the session.
+
+## Install SedonaDB
+
+Use pip to install SedonaDB from the Python Package Index (PyPI).
+
+
+```python
+%pip install "apache-sedona[db]"
+```
+
+## Implementation
+
+To read a geoparquet or parquet file with SedonaDB, you must:
+
+1. **Load** the Parquet file into a data frame using `sd.read_parquet()`.
+2. **Register** the data frame as a view with `to_view()`.
+3. **Query** the view using `sd.sql()`.
+4. **Write** to a Parquet file with `sd.to_parquet()`.

Review Comment:
   ```
   4. **Write** to a Parquet file with `.to_parquet()` or use `.to_pandas()` to 
export it to
       a GeoDataFrame.
   ```



##########
README.md:
##########
@@ -27,7 +27,11 @@ SedonaDB only runs on a single machine, so it’s perfect for 
processing smaller
 
 ## Install
 
-You can install Python SedonaDB with `pip install apache-sedona[db]`.
+You can install Python SedonaDB with PyPi:

Review Comment:
   ```suggestion
   You can install Python SedonaDB with PyPI:
   ```



##########
docs/working-with-parquet-files.md:
##########
@@ -0,0 +1,80 @@
+# Working with Parquet Files
+
+To read a GeoPaquet or Parquet file, you must use the dedicated 
`sd.read_parquet()` method. You cannot query a file path directly within the 
`sd.sql()` `FROM` clause.
+
+The `sd.sql()` function is designed to query tables that have already been 
registered in the session.
+
+## Install SedonaDB
+
+Use pip to install SedonaDB from the Python Package Index (PyPI).
+
+
+```python
+%pip install "apache-sedona[db]"
+```
+
+## Implementation
+
+To read a geoparquet or parquet file with SedonaDB, you must:
+
+1. **Load** the Parquet file into a data frame using `sd.read_parquet()`.
+2. **Register** the data frame as a view with `to_view()`.
+3. **Query** the view using `sd.sql()`.
+4. **Write** to a Parquet file with `sd.to_parquet()`.
+
+
+```python
+# Import the sedona.db module and connect to SedonaDB
+import sedona.db
+sd = sedona.db.connect()
+```
+
+
+```python
+
+# 1. Load the Parquet file
+df = sd.read_parquet(
+    'https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0/'
+    'natural-earth/files/natural-earth_cities_geo.parquet'
+)
+
+# 2. Register the data frame as a view
+df.to_view("zone")
+
+# 3. Query the view and store the result in a new DataFrame
+query_result_df = sd.sql("SELECT * FROM zone LIMIT 10")
+query_result_df.show()
+```

Review Comment:
   Normally this will display a result...was the notebook executed?



##########
docs/working-with-parquet-files.md:
##########
@@ -0,0 +1,80 @@
+# Working with Parquet Files
+
+To read a GeoPaquet or Parquet file, you must use the dedicated 
`sd.read_parquet()` method. You cannot query a file path directly within the 
`sd.sql()` `FROM` clause.
+
+The `sd.sql()` function is designed to query tables that have already been 
registered in the session.
+
+## Install SedonaDB
+
+Use pip to install SedonaDB from the Python Package Index (PyPI).
+
+
+```python
+%pip install "apache-sedona[db]"
+```
+
+## Implementation
+
+To read a geoparquet or parquet file with SedonaDB, you must:
+
+1. **Load** the Parquet file into a data frame using `sd.read_parquet()`.
+2. **Register** the data frame as a view with `to_view()`.
+3. **Query** the view using `sd.sql()`.
+4. **Write** to a Parquet file with `sd.to_parquet()`.
+
+
+```python
+# Import the sedona.db module and connect to SedonaDB
+import sedona.db
+sd = sedona.db.connect()
+```
+
+
+```python
+
+# 1. Load the Parquet file
+df = sd.read_parquet(
+    'https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0/'
+    'natural-earth/files/natural-earth_cities_geo.parquet'
+)
+
+# 2. Register the data frame as a view
+df.to_view("zone")
+
+# 3. Query the view and store the result in a new DataFrame
+query_result_df = sd.sql("SELECT * FROM zone LIMIT 10")
+query_result_df.show()
+```
+
+
+```python
+
+# 4. Write the result to a new Parquet file
+output_path = "query_results.parquet"
+query_result_df.to_parquet(output_path)
+
+# (Optional) Verify the written file
+print(f"\nVerifying the written file at '{output_path}'...")
+verified_df = sd.read_parquet(output_path)
+verified_df.show(5)
+```
+
+### Common Errors
+
+Directly using a file path within `sd.sql()` is a common mistake that will 
result in an error.

Review Comment:
   I think you can remove this section. It works, but it's harder to use than 
the workflow you've documented here:
   
   ```python
   import sedona.db
   
   sd = sedona.db.connect()
   sd.options.interactive = True
   
   sd.sql(
       "SELECT * FROM 
'submodules/geoarrow-data/microsoft-buildings/files/microsoft-buildings_point_geo.parquet'"
   ).limit(5)
   #> ┌──────────────────────────────────────┐
   #> │               geometry               │
   #> │               geometry               │
   #> ╞══════════════════════════════════════╡
   #> │ POINT(-88.84280650000001 37.9056685) │
   #> ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
   #> │ POINT(-88.84049522 37.90522245)      │
   #> ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
   #> │ POINT(-88.84073500000001 37.9055005) │
   #> ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
   #> │ POINT(-88.83995028 37.90585524)      │
   #> ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
   #> │ POINT(-88.838466 37.9050765)         │
   #> └──────────────────────────────────────┘
   ```



##########
docs/working-with-parquet-files.md:
##########
@@ -0,0 +1,80 @@
+# Working with Parquet Files
+
+To read a GeoPaquet or Parquet file, you must use the dedicated 
`sd.read_parquet()` method. You cannot query a file path directly within the 
`sd.sql()` `FROM` clause.
+
+The `sd.sql()` function is designed to query tables that have already been 
registered in the session.
+
+## Install SedonaDB
+
+Use pip to install SedonaDB from the Python Package Index (PyPI).
+
+
+```python
+%pip install "apache-sedona[db]"
+```
+
+## Implementation
+
+To read a geoparquet or parquet file with SedonaDB, you must:

Review Comment:
   ```
   A common workflow for working with GeoParquet and/or Parquet files is:
   ```



##########
docs/index.md:
##########
@@ -24,30 +23,45 @@ title: Introducing SedonaDB
   under the License.
 -->
 
-SedonaDB is a high-performance, dependency-free geospatial compute engine 
designed for single-node processing, making it ideal for smaller datasets on 
local machines or cloud instances.
+SedonaDB is a single-node analytical database engine with geospatial as the 
first-class citizen.
+
+Highly performant and dependency-free, SedonaDB is ideal for working with 
smaller datasets located on local machines or cloud instances.
 
 The initial `0.1` release supports a core set of vector operations, with 
comprehensive vector and raster computation capabilities planned for the near 
future.
 
+For massive, distributed workloads, you can still leverage the power of 
SedonaSpark, SedonaFlink, or SedonaSnow.

Review Comment:
   ```suggestion
   For distributed workloads, you can still leverage the power of SedonaSpark, 
SedonaFlink, or SedonaSnow.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [DOCS] fixing read parquet files page [sedona-db]

Reply via email to