paleolimbot commented on code in PR #110: URL: https://github.com/apache/sedona-db/pull/110#discussion_r2361479840
########## docs/working-with-parquet-files.md: ########## @@ -0,0 +1,80 @@ +# Working with Parquet Files + +To read a GeoPaquet or Parquet file, you must use the dedicated `sd.read_parquet()` method. You cannot query a file path directly within the `sd.sql()` `FROM` clause. + +The `sd.sql()` function is designed to query tables that have already been registered in the session. Review Comment: Putting them in the FROM clause should work! How about: ``` The easiest way to read a GeoParquet or Parquet file is to use `sd.read_parquet()`; however, you can also query GeoParquet or Parquet files by referring to their path from SQL. ``` ########## docs/working-with-parquet-files.md: ########## @@ -0,0 +1,80 @@ +# Working with Parquet Files + +To read a GeoPaquet or Parquet file, you must use the dedicated `sd.read_parquet()` method. You cannot query a file path directly within the `sd.sql()` `FROM` clause. + +The `sd.sql()` function is designed to query tables that have already been registered in the session. + +## Install SedonaDB + +Use pip to install SedonaDB from the Python Package Index (PyPI). + + +```python +%pip install "apache-sedona[db]" +``` + +## Implementation + +To read a geoparquet or parquet file with SedonaDB, you must: + +1. **Load** the Parquet file into a data frame using `sd.read_parquet()`. +2. **Register** the data frame as a view with `to_view()`. +3. **Query** the view using `sd.sql()`. +4. **Write** to a Parquet file with `sd.to_parquet()`. + + +```python +# Import the sedona.db module and connect to SedonaDB +import sedona.db +sd = sedona.db.connect() +``` + + +```python + +# 1. Load the Parquet file +df = sd.read_parquet( + 'https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0/' + 'natural-earth/files/natural-earth_cities_geo.parquet' +) + +# 2. Register the data frame as a view +df.to_view("zone") + +# 3. Query the view and store the result in a new DataFrame +query_result_df = sd.sql("SELECT * FROM zone LIMIT 10") +query_result_df.show() +``` + + +```python + +# 4. Write the result to a new Parquet file +output_path = "query_results.parquet" +query_result_df.to_parquet(output_path) + +# (Optional) Verify the written file +print(f"\nVerifying the written file at '{output_path}'...") +verified_df = sd.read_parquet(output_path) +verified_df.show(5) Review Comment: I'm not sure these lines are adding to the tutorial ########## requirements-docs.txt: ########## Review Comment: I believe this will break CI (and the code you've documented above) ########## docs/index.md: ########## @@ -24,30 +23,45 @@ title: Introducing SedonaDB under the License. --> -SedonaDB is a high-performance, dependency-free geospatial compute engine designed for single-node processing, making it ideal for smaller datasets on local machines or cloud instances. +SedonaDB is a single-node analytical database engine with geospatial as the first-class citizen. + +Highly performant and dependency-free, SedonaDB is ideal for working with smaller datasets located on local machines or cloud instances. Review Comment: ```suggestion Fast and dependency-free, SedonaDB is ideal for working with smaller datasets located on local machines or cloud instances. ``` ########## docs/working-with-parquet-files.md: ########## @@ -0,0 +1,80 @@ +# Working with Parquet Files + +To read a GeoPaquet or Parquet file, you must use the dedicated `sd.read_parquet()` method. You cannot query a file path directly within the `sd.sql()` `FROM` clause. + +The `sd.sql()` function is designed to query tables that have already been registered in the session. + +## Install SedonaDB + +Use pip to install SedonaDB from the Python Package Index (PyPI). + + +```python +%pip install "apache-sedona[db]" +``` Review Comment: Does this need to be replicated? ########## docs/working-with-parquet-files.md: ########## @@ -0,0 +1,80 @@ +# Working with Parquet Files + +To read a GeoPaquet or Parquet file, you must use the dedicated `sd.read_parquet()` method. You cannot query a file path directly within the `sd.sql()` `FROM` clause. + +The `sd.sql()` function is designed to query tables that have already been registered in the session. + +## Install SedonaDB + +Use pip to install SedonaDB from the Python Package Index (PyPI). + + +```python +%pip install "apache-sedona[db]" +``` + +## Implementation + +To read a geoparquet or parquet file with SedonaDB, you must: + +1. **Load** the Parquet file into a data frame using `sd.read_parquet()`. +2. **Register** the data frame as a view with `to_view()`. +3. **Query** the view using `sd.sql()`. +4. **Write** to a Parquet file with `sd.to_parquet()`. Review Comment: ``` 4. **Write** to a Parquet file with `.to_parquet()` or use `.to_pandas()` to export it to a GeoDataFrame. ``` ########## README.md: ########## @@ -27,7 +27,11 @@ SedonaDB only runs on a single machine, so it’s perfect for processing smaller ## Install -You can install Python SedonaDB with `pip install apache-sedona[db]`. +You can install Python SedonaDB with PyPi: Review Comment: ```suggestion You can install Python SedonaDB with PyPI: ``` ########## docs/working-with-parquet-files.md: ########## @@ -0,0 +1,80 @@ +# Working with Parquet Files + +To read a GeoPaquet or Parquet file, you must use the dedicated `sd.read_parquet()` method. You cannot query a file path directly within the `sd.sql()` `FROM` clause. + +The `sd.sql()` function is designed to query tables that have already been registered in the session. + +## Install SedonaDB + +Use pip to install SedonaDB from the Python Package Index (PyPI). + + +```python +%pip install "apache-sedona[db]" +``` + +## Implementation + +To read a geoparquet or parquet file with SedonaDB, you must: + +1. **Load** the Parquet file into a data frame using `sd.read_parquet()`. +2. **Register** the data frame as a view with `to_view()`. +3. **Query** the view using `sd.sql()`. +4. **Write** to a Parquet file with `sd.to_parquet()`. + + +```python +# Import the sedona.db module and connect to SedonaDB +import sedona.db +sd = sedona.db.connect() +``` + + +```python + +# 1. Load the Parquet file +df = sd.read_parquet( + 'https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0/' + 'natural-earth/files/natural-earth_cities_geo.parquet' +) + +# 2. Register the data frame as a view +df.to_view("zone") + +# 3. Query the view and store the result in a new DataFrame +query_result_df = sd.sql("SELECT * FROM zone LIMIT 10") +query_result_df.show() +``` Review Comment: Normally this will display a result...was the notebook executed? ########## docs/working-with-parquet-files.md: ########## @@ -0,0 +1,80 @@ +# Working with Parquet Files + +To read a GeoPaquet or Parquet file, you must use the dedicated `sd.read_parquet()` method. You cannot query a file path directly within the `sd.sql()` `FROM` clause. + +The `sd.sql()` function is designed to query tables that have already been registered in the session. + +## Install SedonaDB + +Use pip to install SedonaDB from the Python Package Index (PyPI). + + +```python +%pip install "apache-sedona[db]" +``` + +## Implementation + +To read a geoparquet or parquet file with SedonaDB, you must: + +1. **Load** the Parquet file into a data frame using `sd.read_parquet()`. +2. **Register** the data frame as a view with `to_view()`. +3. **Query** the view using `sd.sql()`. +4. **Write** to a Parquet file with `sd.to_parquet()`. + + +```python +# Import the sedona.db module and connect to SedonaDB +import sedona.db +sd = sedona.db.connect() +``` + + +```python + +# 1. Load the Parquet file +df = sd.read_parquet( + 'https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0/' + 'natural-earth/files/natural-earth_cities_geo.parquet' +) + +# 2. Register the data frame as a view +df.to_view("zone") + +# 3. Query the view and store the result in a new DataFrame +query_result_df = sd.sql("SELECT * FROM zone LIMIT 10") +query_result_df.show() +``` + + +```python + +# 4. Write the result to a new Parquet file +output_path = "query_results.parquet" +query_result_df.to_parquet(output_path) + +# (Optional) Verify the written file +print(f"\nVerifying the written file at '{output_path}'...") +verified_df = sd.read_parquet(output_path) +verified_df.show(5) +``` + +### Common Errors + +Directly using a file path within `sd.sql()` is a common mistake that will result in an error. Review Comment: I think you can remove this section. It works, but it's harder to use than the workflow you've documented here: ```python import sedona.db sd = sedona.db.connect() sd.options.interactive = True sd.sql( "SELECT * FROM 'submodules/geoarrow-data/microsoft-buildings/files/microsoft-buildings_point_geo.parquet'" ).limit(5) #> ┌──────────────────────────────────────┐ #> │ geometry │ #> │ geometry │ #> ╞══════════════════════════════════════╡ #> │ POINT(-88.84280650000001 37.9056685) │ #> ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ #> │ POINT(-88.84049522 37.90522245) │ #> ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ #> │ POINT(-88.84073500000001 37.9055005) │ #> ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ #> │ POINT(-88.83995028 37.90585524) │ #> ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤ #> │ POINT(-88.838466 37.9050765) │ #> └──────────────────────────────────────┘ ``` ########## docs/working-with-parquet-files.md: ########## @@ -0,0 +1,80 @@ +# Working with Parquet Files + +To read a GeoPaquet or Parquet file, you must use the dedicated `sd.read_parquet()` method. You cannot query a file path directly within the `sd.sql()` `FROM` clause. + +The `sd.sql()` function is designed to query tables that have already been registered in the session. + +## Install SedonaDB + +Use pip to install SedonaDB from the Python Package Index (PyPI). + + +```python +%pip install "apache-sedona[db]" +``` + +## Implementation + +To read a geoparquet or parquet file with SedonaDB, you must: Review Comment: ``` A common workflow for working with GeoParquet and/or Parquet files is: ``` ########## docs/index.md: ########## @@ -24,30 +23,45 @@ title: Introducing SedonaDB under the License. --> -SedonaDB is a high-performance, dependency-free geospatial compute engine designed for single-node processing, making it ideal for smaller datasets on local machines or cloud instances. +SedonaDB is a single-node analytical database engine with geospatial as the first-class citizen. + +Highly performant and dependency-free, SedonaDB is ideal for working with smaller datasets located on local machines or cloud instances. The initial `0.1` release supports a core set of vector operations, with comprehensive vector and raster computation capabilities planned for the near future. +For massive, distributed workloads, you can still leverage the power of SedonaSpark, SedonaFlink, or SedonaSnow. Review Comment: ```suggestion For distributed workloads, you can still leverage the power of SedonaSpark, SedonaFlink, or SedonaSnow. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
